action #40670
closed[sle][functional][u][medium][tools][sporadic] test fails in welcome - typos in kernel boot parameters
0%
Description
Observation¶
Url is not typed correctly in boot options. A mutex_wait is right after typing it and VERY_SLOW_TYPING_SPEED is already used.
Sample from autoinst-log.txt
[2018-10-10T10:24:10.0807 CEST] [debug] <<< testapi::type_string(string=' regurl=http://Server-0421.proxy.scc.suse.de', max_interval=4, wait_screen_changes=0, wait_still_screen=0)
Sample from YaST2/y2log
2018-10-10 04:26:33 <1> install(3575) [Ruby] registration/registration.rb:277 Using custom registration URL: "http://Server-0421.proxy.scc.suse.e"
(Note: interesting date-time difference)
Reproducible¶
On PPC, it failed once since at Build 41.4
On x86_64, it failed at least 6 times in last 100 runs:
Acceptance criteria¶
- AC1: The reason why typed characters are missing is known. fulfilled
- AC2: Workers on OSD use at least qemu-2.11.2
- AC3: It is not possible to reproduce the issue on 100 runs.
Suggestions¶
Hypotheses¶
- H1.1 It is caused by os-autoinst backend (qemu). SUPPORTED BY E1.1-2
- H2.1 It is not worker specific. SUPPORTED BY E1.1-1
- H3.1 Making os-autoinst type even slower doesn't help. SUPPORTED BY E3.1-1
- H3.2 Making os-autoinst type even slower doesn't solve the issue. SUPPORTED BY E3.1-1
- H4.1 Product changes causes this issue. SUPPORTED BY E4.1-1
Experiments¶
- E1.1-1 Perform statistical investigation with current qemu version (qemu 2.9.1).
- R1.1-1 2018-09-24 100 job runs
- 85 of 100 failed jobs. (jobs are gone, not possible to inspect logs)
- Just space is missed, like here (...linuxYDEBUG=1 instead of ...linux YDEBUG=1);
- Space and "Y" character are missed, like here (...linuxDEBUG=1 instead of ...linux YDEBUG=1);
- R1.1-1 2018-10-17 101 job runs
- 12/101 fail (The number of failures has been reduced since last time the experiment was conducted: assuming that the new qemu version released meanwhile causes these improvements)
- QEMU emulator version 2.9.1(openSUSE Leap 42.3)
- It fails on different workers.
- It doesn't fail always on the same workers.
- E1.1-2 Perform statistical investigation with a different version of qemu (2.11.2)
- R1.1-2 2018-10-17 101 job runs
- 0/101 fail
- QEMU emulator version 2.11.2(openSUSE Leap 15.0)
- E3.1-1 Try 100 job runs with slower typing.
- use constant EXTREME_SLOW_TYPING_SPEED => 2;
- registration_bootloader_params(utils::EXTREME_SLOW_TYPING_SPEED);
- R3.1-1 2018-10-18 101 job runs
- 16/101 fail
- It is still failing
- E4.1-1 Try with an older version of the product (SLE15-GM-668.1).
- R4.1-1 2018-10-19 (101 job runs)[[https://openqa.suse.de/tests/overview?build=misstyped-linuxrc-poo40670-C]]
- 0/101 fail
Expected result¶
Last good: 35.18 (or more recent)
Further details¶
Always latest result in these scenarios:
- sle-15-SP1-Installer-DVD-ppc64le-create_hdd_pcm_azure@ppc64le:latest
- sle-12-SP4-Server-DVD-x86_64-RAID1@64bit:bootloader
Workaround¶
Retrigger tests
Updated by okurz over 6 years ago
- Subject changed from [sle][functional] test fails in welcome -typing wrongly regurl to [sle][functional][y] test fails in welcome -typing wrongly regurl
- Target version set to Milestone 20
Updated by mloviska about 6 years ago
There is another typo error ( missing space between Y2DEBUG and linux location ) on ppc64le https://openqa.suse.de/tests/2052952#step/bootloader/12.
Updated by mloviska about 6 years ago
- Subject changed from [sle][functional][y] test fails in welcome -typing wrongly regurl to [sle][functional][y] test fails in welcome - typos in kernel boot parameters
Updated by nicksinger about 6 years ago
All of those failed in typing the regurl
correctly (confirmed with logs + video):
https://openqa.suse.de/tests/2060935#step/bootloader/8
https://openqa.suse.de/tests/2060924#step/reboot_and_install/19
https://openqa.suse.de/tests/2060892#step/bootloader/8
https://openqa.suse.de/tests/2060963#step/reboot_and_install/19
Updated by coolo about 6 years ago
As thehejik pointed out: also with passwords - https://openqa.suse.de/tests/2061324#step/bootloader/17
Updated by nicksinger about 6 years ago
It also destroys each and every ppc64 test on 15SP1 because it fails to append parameters after the kernel. Example: https://openqa.suse.de/tests/2058405#step/welcome/1
The code itself shouldn't cause this problem (note the leading space in front of Y2DEBUG=1
):
[2018-09-16T16:17:54.0144 CEST] [debug] <<< testapi::send_key(key='end', do_wait=0)
[2018-09-16T16:17:54.0351 CEST] [debug] waiting for screen change: 0 1000000
[2018-09-16T16:17:54.0862 CEST] [debug] waiting for screen change: 0 25.2867820069347
[2018-09-16T16:17:54.0862 CEST] [debug] >>> testapi::wait_screen_change: screen change seen at 0
[2018-09-16T16:17:54.0862 CEST] [debug] /var/lib/openqa/cache/tests/sle/tests/installation/bootloader.pm:27 called bootloader_setup::bootmenu_default_params
[2018-09-16T16:17:54.0862 CEST] [debug] <<< testapi::type_string(string=' Y2DEBUG=1 ', max_interval=4, wait_screen_changes=0, wait_still_screen=0)
Updated by okurz about 6 years ago
- Due date set to 2018-10-09
- Priority changed from Normal to High
- Target version changed from Milestone 20 to Milestone 19
Updated by okurz about 6 years ago
- Subject changed from [sle][functional][y] test fails in welcome - typos in kernel boot parameters to [sle][functional][y][fast] test fails in welcome - typos in kernel boot parameters
- Due date changed from 2018-10-09 to 2018-09-25
- Priority changed from High to Urgent
Blocks testing on ppc64le
Updated by okurz about 6 years ago
- Subject changed from [sle][functional][y][fast] test fails in welcome - typos in kernel boot parameters to [sle][functional][u][fast] test fails in welcome - typos in kernel boot parameters
- Status changed from New to In Progress
- Assignee set to zluo
Updated by okurz about 6 years ago
- Has duplicate action #41138: [sle][functional][u][fast] test fails in bootloader - openQA can't confirm kernel options added
Updated by okurz about 6 years ago
- test changes: No relevant changes, checking logs we type the correct string with still the correct speed
- product changes: We see the same also on 15SP1 so I doubt it is related to product changes
- We find passing tests on all malbec, qa-power8-5, qa-power8-4, powerqaworker-qam-1
https://openqa.suse.de/tests/2058508#step/bootloader/12 is an example where not the space before Y2DEBUG is missing but the starting character is lowercase. We assume it is in the end the same issue because AFAIK for typing it presses the shift-key and then the next characters.
As I have found no failure on a machine different than qa-worker8-4 I assume that machine is at fault.
https://gitlab.suse.de/openqa/salt-pillars-openqa/merge_requests/118 prepared to exclude the machine until the investigation is done.
Tested on osd with sudo salt 'QA-Power8-4-kvm*' state.apply openqa.worker test=True
Changes:
----------
diff:
---
+++
@@ -6,7 +6,7 @@
HOST=openqa.suse.de
CACHEDIRECTORY=/var/lib/openqa/cache
QEMUTHREADS=1
-WORKER_CLASS=qemu_ppc64le_disabled_investigate_poo40670,qemu_ppc64le_no_tmpfs
+WORKER_CLASS=qemu_ppc64le,qemu_ppc64le_no_tmpfs
WORKER_HOSTNAME=10.162.6.201
[openqa.suse.de]
And executed again in hot.
side-task: zluo created "investigation" jobs but left them in the production job group. To move them to the development job group I used the following commands
for i in $(openqa_client_osd --json-output jobs build=0393 arch=ppc64le groupid=139 test=check_bootloader | jq '.jobs | .[] | .id') ;do openqa_client_osd jobs/$i put --json-data '{"group_id": 132}' ; done
for i in $(openqa_client_osd --json-output jobs build=0393 arch=ppc64le groupid=139 test="check_bootloader@ppc64le" | jq '.jobs | .[] | .id') ;do openqa_client_osd jobs/$i put --json-data '{"group_id": 132}' ; done
Retriggered all tests that failed on SLE12SP4 in the corresponding same step. With this I feel we have reduced the urgency give back the task to the backlog and involve the tools team. I doubt zluo can easily drive the work here. I suggest to investigate first by gathering better data, e.g. trigger a corresponding test scenario confined to that machine with WORKER_CLASS=qemu_ppc64le_disabled_investigate_poo40670
Updated by okurz about 6 years ago
- Subject changed from [sle][functional][u][fast] test fails in welcome - typos in kernel boot parameters to [sle][functional][u][fast][tools] test fails in welcome - typos in kernel boot parameters
- Status changed from In Progress to Workable
- Assignee deleted (
zluo) - Priority changed from Urgent to Normal
Updated by okurz about 6 years ago
I did not find anything related on the workers that differs, at least not in the parameters of the kernel modules "kvm" or "kvm_hv"
okurz@openqa:~> sudo salt 'QA-Power8*' cmd.run 'for i in kvm kvm_hv ; do modinfo $i ; done'
QA-Power8-4-kvm.qa.suse.de:
filename: /lib/modules/4.4.140-94.42-default/kernel/arch/powerpc/kvm/kvm.ko
license: GPL
author: Qumranet
srcversion: AB9D9B9EA7191D23D5C11C5
depends:
supported: yes
intree: Y
vermagic: 4.4.140-94.42-default SMP mod_unload modversions
signer: SUSE Linux Enterprise Secure Boot Signkey
sig_key: 3F:B0:77:B6:CE:BC:6F:F2:52:2E:1C:14:8C:57:C7:77:C7:88:E3:E7
sig_hashalgo: sha256
parm: halt_poll_ns:uint
parm: halt_poll_ns_grow:int
parm: halt_poll_ns_shrink:int
filename: /lib/modules/4.4.140-94.42-default/kernel/arch/powerpc/kvm/kvm-hv.ko
alias: devname:kvm
alias: char-major-10-232
license: GPL
srcversion: C42262D681C35A7BF4CEDCE
depends: kvm
supported: yes
intree: Y
vermagic: 4.4.140-94.42-default SMP mod_unload modversions
signer: SUSE Linux Enterprise Secure Boot Signkey
sig_key: 3F:B0:77:B6:CE:BC:6F:F2:52:2E:1C:14:8C:57:C7:77:C7:88:E3:E7
sig_hashalgo: sha256
parm: dynamic_mt_modes:Set of allowed dynamic micro-threading modes: 0 (= none), 2, 4, or 6 (= 2 or 4) (int)
parm: target_smt_mode:Target threads per core (0 = max) (int)
QA-Power8-5-kvm.qa.suse.de:
filename: /lib/modules/4.4.140-94.42-default/kernel/arch/powerpc/kvm/kvm.ko
license: GPL
author: Qumranet
srcversion: AB9D9B9EA7191D23D5C11C5
depends:
supported: yes
intree: Y
vermagic: 4.4.140-94.42-default SMP mod_unload modversions
signer: SUSE Linux Enterprise Secure Boot Signkey
sig_key: 3F:B0:77:B6:CE:BC:6F:F2:52:2E:1C:14:8C:57:C7:77:C7:88:E3:E7
sig_hashalgo: sha256
parm: halt_poll_ns:uint
parm: halt_poll_ns_grow:int
parm: halt_poll_ns_shrink:int
filename: /lib/modules/4.4.140-94.42-default/kernel/arch/powerpc/kvm/kvm-hv.ko
alias: devname:kvm
alias: char-major-10-232
license: GPL
srcversion: C42262D681C35A7BF4CEDCE
depends: kvm
supported: yes
intree: Y
vermagic: 4.4.140-94.42-default SMP mod_unload modversions
signer: SUSE Linux Enterprise Secure Boot Signkey
sig_key: 3F:B0:77:B6:CE:BC:6F:F2:52:2E:1C:14:8C:57:C7:77:C7:88:E3:E7
sig_hashalgo: sha256
parm: dynamic_mt_modes:Set of allowed dynamic micro-threading modes: 0 (= none), 2, 4, or 6 (= 2 or 4) (int)
parm: target_smt_mode:Target threads per core (0 = max) (int)
okurz@openqa:~> sudo salt 'QA-Power8*' cmd.run 'cat /sys/module/kvm*/parameters/*'
QA-Power8-4-kvm.qa.suse.de:
500000
2
0
6
0
QA-Power8-5-kvm.qa.suse.de:
500000
2
0
6
0
okurz@openqa:~> sudo salt 'malbec*' cmd.run 'cat /sys/module/kvm*/parameters/*'
malbec.arch.suse.de:
500000
2
0
6
0
Updated by riafarov about 6 years ago
- Has duplicate action #40412: [sle][functional][y] test fails in welcome - ESC triggered in GRUB added
Updated by oorlov about 6 years ago
- Status changed from Workable to In Progress
As per @okurz suggestion, I've executed 100 runs with "WORKER_CLASS=qemu_ppc64le_disabled_investigate_poo40670".
After they are finish, I'll investigate the statistics of passes/fails.
Updated by oorlov about 6 years ago
- Status changed from In Progress to Workable
- Assignee deleted (
oorlov)
Updated by SLindoMansilla about 6 years ago
- Due date changed from 2018-09-25 to 2018-10-09
Moving to sprint 27 since we were not able to resolve it during sprint 26
Updated by SLindoMansilla about 6 years ago
- Subject changed from [sle][functional][u][fast][tools] test fails in welcome - typos in kernel boot parameters to [sle][functional][u][medium][tools] test fails in welcome - typos in kernel boot parameters
- Description updated (diff)
- Priority changed from Normal to High
- Estimated time set to 5.00 h
- Difficulty set to medium
Let's re estimate with the proper information after investigation is finished.
Updated by jorauch about 6 years ago
- Status changed from Workable to In Progress
- Assignee set to jorauch
Taking a look, maybe we can already close this according to AC
Updated by jorauch about 6 years ago
Cross checking with other worker needed for final proof that this is a worker issue, with a rate of 85% 10 runs should be enough, still have to figure out which worker class to use for this
'QA-Power8-4-kvm*' is the questioned worker
"qemu_ppc64le" is the class to use here since the affected 8-4-kvm worker is still changed to the other
'GROUP="SLE15 Development" TEST=poo40670_crosscheck WORKER_CLASS=qemu_ppc64le'
Updated by jorauch about 6 years ago
- Status changed from In Progress to Feedback
And here we have the crosscheck runs:
https://openqa.suse.de/tests/overview?distri=sle&version=15-SP1&build=jrauch_validation&groupid=96
Updated by jorauch about 6 years ago
And all of them passed the bootloader finely, additionally they ran on more than one worker (excluding 8-4-kvm).
with an initial error quote of 85% 10 tests should be proofing that this is a worker specific issue.
Updated by jorauch about 6 years ago
- Status changed from Feedback to Resolved
Closing this and creating a follow up to fix the root cause of this
Updated by jorauch about 6 years ago
- Related to action #42032: [sle][functional][u][tools] QA-power-8-*-kvm have missing keys (ninjakeys) added
Updated by okurz about 6 years ago
- Related to action #42275: [functional][u][sporadic] test fails in bootloader with mistyped/incomplete regurl on x86_64 added
Updated by okurz about 6 years ago
jorauch wrote:
with an initial error quote of 85% 10 tests should be proofing that this is a worker specific issue.
I created another ticket for a similar issue on x86_64 which is #42275 so even if we find out that the ppc64le worker is special here we still have a problem on x86_64.
Updated by okurz about 6 years ago
- Due date deleted (
2018-10-09) - Status changed from Resolved to Workable
- Target version changed from Milestone 19 to Milestone 20
@jorauch want to stick?
Updated by jorauch about 6 years ago
- Assignee deleted (
jorauch)
Actuallly no, will free this for someone with fresh ideas
Updated by SLindoMansilla about 6 years ago
- Subject changed from [sle][functional][u][medium][tools] test fails in welcome - typos in kernel boot parameters to [sle][functional][u][medium][tools][sporadic] test fails in welcome - typos in kernel boot parameters
- Description updated (diff)
- Status changed from Workable to In Progress
- Assignee set to SLindoMansilla
Investigating
Updated by SLindoMansilla about 6 years ago
- Related to deleted (action #42275: [functional][u][sporadic] test fails in bootloader with mistyped/incomplete regurl on x86_64)
Updated by SLindoMansilla about 6 years ago
- Has duplicate action #42275: [functional][u][sporadic] test fails in bootloader with mistyped/incomplete regurl on x86_64 added
Updated by SLindoMansilla about 6 years ago
- Description updated (diff)
- Priority changed from High to Urgent
Updated by SLindoMansilla about 6 years ago
- Description updated (diff)
The workers on OSD use openSUSE Leap 42.3, which has qemu-2.9.1 (See https://build.opensuse.org/package/view_file/openSUSE:Leap:42.3:Update/qemu/qemu.spec?expand=1).
- To have qemu-2.11.2 it is necessary to upgrade those machines to openSUSE Leap 15.0 (see https://build.opensuse.org/package/view_file/openSUSE:Leap:15.0:Update/qemu/qemu.spec?expand=1)
- or to submit qemu-2.11.2 to openSUSE Leap 42.3
Updated by okurz about 6 years ago
I suggest to still try with even slower typing in the bootloader screen, not the test variable but even slower typing explicitly in the bootloader screen. I can see in https://openqa.suse.de/tests/2164495/modules/bootloader/steps/1/src that we already give an explicit parameter to the line registration_bootloader_params(utils::VERY_SLOW_TYPING_SPEED);
so we can schedule tests even on production but override that individual test module with one that has even slower typing, e.g. registration_bootloader_params(2);
. This can be scheduled using the parameter "SCHEDULE" of os-autoinst and supplying the patched test module as a downloadable asset. You can put the patched module on a webserver, e.g. your export folder and then download it into the job and schedule, e.g. openqa-clone-job … ASSET_1_URL=https://w3.nue.suse.com/~slindomansilla/bootloader_poo40670.pm SCHEDULE=bootloader_poo40670,tests/installation/welcome
Updated by SLindoMansilla about 6 years ago
- Description updated (diff)
Bug against Leap 42.3 created: https://bugzilla.opensuse.org/show_bug.cgi?id=1112305
Updated by SLindoMansilla about 6 years ago
okurz suggested to create a constant with a lower value.
PR here: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5998
Updated by SLindoMansilla about 6 years ago
- Description updated (diff)
New round with slower typing: https://openqa.suse.de/tests/overview?build=misstyped-linuxrc-poo40670-B
Already failed on first job.
Updated by okurz about 6 years ago
Ok, so even slower typing does not help. You can revert your PR then I guess.
Could you please crosscheck with an older released product, e.g. SLE15-GM-668.1, to rule out that product changes caused this?
Updated by SLindoMansilla about 6 years ago
- Description updated (diff)
I had to copy the iso to OSD, waiting for verification: https://openqa.suse.de/tests/overview?build=misstyped-linuxrc-poo40670-C
Updated by SLindoMansilla about 6 years ago
- Description updated (diff)
- Status changed from In Progress to Workable
Product changes are also responsible for the issue.
No idea how to continue from here.
Updated by SLindoMansilla about 6 years ago
PR to revert changes: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6004
Updated by SLindoMansilla about 6 years ago
- Status changed from Workable to In Progress
Try info parameter (https://en.opensuse.org/SDB:Linuxrc#Parameter_Reference):
- info=http://$WORKER_IP/files/info
Updated by SLindoMansilla about 6 years ago
Updated by riafarov about 6 years ago
Same issue here: https://openqa.suse.de/tests/2199334# , wrong url to the profile is typed.
Updated by okurz about 6 years ago
- Status changed from In Progress to Workable
ready to look into again by slindomansilla maybe next week
Updated by SLindoMansilla about 6 years ago
- Status changed from Workable to In Progress
Trying to implement the linuxrc parameter info, to specify parameters on a configuration file.
Updated by SLindoMansilla about 6 years ago
The file format for the info file is very tricky, and I haven't got it working through https (but probably we don't need https for that).
A working sample file can be seen here: https://w3.nue.suse.com/~slindomansilla/tmp/linuxrc_parameters.info
Updated by okurz about 6 years ago
@SLindoMansilla In #40670#note-7 we mentioned that basically ppc64le is unusable due to this. This does not seem to be the case anymore. I have the feeling that we see now quite seldomly the originally observed problem but I do not know the reason. https://w3.nue.suse.com/~okurz/openqa_suse_de_status.html mentions poo#40670 a single time. Don't know what changed but I guess it is not urgent anymore, we should reduce to high and re-discuss what approach to follow. How would you assess the current situation?
Updated by okurz about 6 years ago
And in https://openqa.suse.de/tests/2213493/file/autoinst-log.txt I see one problem. The regurl again is typed with not "very slow" but only "slow" which can explain the problems to me.
I will handle that special situation -> https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6094
Updated by nicksinger about 6 years ago
The info-approach should improve the situation significantly (since it would avoid typing long URLs and such) but doesn't avoid typing inside the bootloader completely.
Definitely worth a try. If it doesn't help, here is another idea:
We could have a specific bootloader per worker (e.g. ipxe) which always chainloads its configuration. This configuration could be written by openqa beforehand. This way we could eliminate all typing in the bootloader completely. However, it would involve creating the script for the bootloader and teaching openQA to write these configurations. So it would be a bigger rewrite of the backend there.
Updated by okurz about 6 years ago
again we seem to be back to a single specific problem: https://w3.nue.suse.com/~okurz/openqa_suse_de_status.html mentions this ticket in only two jobs, https://openqa.suse.de/tests/2228503 and https://openqa.suse.de/tests/2228489 both in SLE15SP1 and both showing the same problem, " Y2DEBUG=1" is typed without the leading space. This is not even the graphical grub screen. I assume what is happening here is that the previous key sequence with the key "end" is sent and then we immediately type the next character without waiting enough to actually reach the end of the line. Can we handle just that special case for now please?
Updated by SLindoMansilla about 6 years ago
Updated by SLindoMansilla about 6 years ago
PR to have more shared workers (PPC): https://gitlab.suse.de/qa-sle/shared-development-worker/merge_requests/4
Updated by SLindoMansilla about 6 years ago
- Status changed from In Progress to Feedback
Statistical investigation: http://slindomansilla-vm.qa.suse.de/tests/overview?build=poo40670_ppc_bootloader_mystyping
Updated by SLindoMansilla about 6 years ago
The investigation doesn't show the issue of missing white space before Y2DEBUG.
PR merged.
Preparing statistical investigation on OSD.
Updated by SLindoMansilla about 6 years ago
Waiting for verification run: https://openqa.suse.de/tests/overview?build=poo40670_mistyped_y2debug
Updated by okurz about 6 years ago
- Copied to action #43655: [qe-core][functional] Increase robustness of using bootloader parameter with info-file instead of long typing added
Updated by okurz about 6 years ago
0/101 passed, I think we have this particular issue covered. I created #43655 to cover the info-file approach. I guess we can close this ticket here as we do not seem to see any current, reproducible bootloader typing parameter issues anymore.
Updated by SLindoMansilla about 6 years ago
- Status changed from Feedback to Resolved
Well, I think you mean 0/101 failed ;)
Agree. Urgency removed. Approach to improve boot parameters typing will follow on: #43655
Updated by okurz about 6 years ago
yes, this is what I meant, this is what I thought. What I wrote is stupid
Updated by okurz about 6 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: RAID5
https://openqa.suse.de/tests/2244625
Updated by okurz about 1 year ago
- Related to action #139271: Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:M added