[sle][functional][migration][opensuse][virtualization]Increase/disable timeout of initial grub menue to ensure tests do not miss it
There are many openQA tests in all different scenarios regardless of the worker or machine type which sometimes happen to skip over the grub menue and then fail in a fully booted desktop still looking for "grub2". Sometimes it shows up in the video for a glimpse, sometimes not.
Example: opensuse-5.10.90-Krypton-Live-x86_64-krypton-live-installation@64bit-2G fails in grub_test
All different kind of scenarios but not reproducible every time
After the machine reboots the grub screen should be catched within the 8 second default timeout: https://openqa.opensuse.org/tests/494564#step/grub_test/1
It seems we can not ensure a reliable testing environment where the full screen content shows up in the early phases of boot. This is either a problem we always had but was not important enough to handle because of less test scenarios or there were changes in os-autoinst that introduced a regression.
In any case a fix to prevent this would be to configure the bootloader in the installer to use a much higher timeout, e.g. 60 seconds, or disable it fully.
- take a look in
tests/installation/disable_grub_graphics.pmhow it enters the bootloader configuration menue from the installer and disables the grub graphics (time estimation: 0.1 - 0.5h)
- use the same approach in a new test module with explicit name, e.g. 'change_grub_timeout.pm' to bump the timeout to 60 seconds (or a similar high value) or disable the grub timeout (time estimation: 0.5 - 2h)
- make sure this test module is used in all relevant tests, e.g. all openSUSE+SLE installation tests, including the live cd (time estimation: 0.5 - 4h)
- changes to "first_boot" or the
wait_bootfunction should not be necessary because the bootloader menue should then still show up and is searched for within the normal timeout but by bumping the grub timeout itself we should ensure the bootloader menue to show up long enough so that it can be catched
Always latest result in this scenario: latest
#13 Updated by JERiveraMoya over 5 years ago
Need also needles for Tumbleweed, pending to send another PR to my fork and check with Sergio about these two above in my forks for being improved before to send to the official repo.
These two should cover scenarios in Text mode and Graphic for Tumbleweed. Successfully tested in my local four scenarios.
#14 Updated by okurz over 5 years ago
feel free to provide a PR and mark the subject line as "[WIP]" to prevent it from being merged but invite for early feedback. Until then I will take a look into your github commit and comment there.
EDIT: hm, I thought I could comment on a commit as well but I can't. Please provide a PR then
#16 Updated by JERiveraMoya over 5 years ago
GRUB_TIMEOUT=-1 will cause the menu to be displayed until it is selected the boot entry manually.
Renamed tests and needles from change_grub_timeout to disable_grub_timeout.
This change does not make needling easier, given it is just type -1 instead of 60 seconds. Needles added just are need to ensure the path to the field where the value is typed (i.e.: correct tab selected, etc.)
Found missing needles in Tumbleweed text mode. Currently fixing it before sending [WIP] PR.
#20 Updated by JERiveraMoya over 5 years ago
Tested for Leap 42.3 Updates and Maintenance. New needles were created for Opensuse Leap. There are some subtle differences between Tumbleweed and Leap (i.e.: the lines to draw menu tabs have difference length, selected text has a square bigger in one than the other, etc.)
#25 Updated by okurz over 5 years ago
jrivera: looks like leap 15.0 uses alt-l? https://openqa.opensuse.org/tests/506564#step/disable_grub_timeout/6
#26 Updated by JERiveraMoya over 5 years ago
Fixed several issues for other products:
- Test included in more test scenarios.
- SLE 12 GA: Fixed problem related with the range accepted for Timeout [0,300]. Instead of -1, for this product version is typed 60 seconds.
- SLE 12 SP1: missing needles.
- Opensuse Leap: different shortcut to access tab & corresponding needles.
#27 Updated by AndreasStieger over 5 years ago
But this seems to fail on Leap Maintenance?
disable_grub_timeout is failing in Leap Maintenance tests:
#28 Updated by okurz over 5 years ago
Looks like wrong hotkey for leap 42.3. https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3742 should be able to handle this.verified
#32 Updated by JERiveraMoya over 5 years ago
New PR for fixing most of the found failures (hopefully): https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3749
Found today that sle-15 for aarch64 is 'alt + t': https://openqa.suse.de/tests/1224390#step/disable_grub_timeout/3
#33 Updated by JERiveraMoya over 5 years ago
Last PR seems to work except for some needles in SLE-15 that are in different position in the screen, we might need another wait_still_screen(1) before sending key for selecting tab. BUT for leap 42.2 & 42.3 on machine 64bit-2G is not implemented the exception yet, so it is making fail a lot of tests, as expected. Added PR for workaround:
#34 Updated by SLindoMansilla over 5 years ago
PR to fix wrong hotkey on x86_64: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3760
Verified on OSD: https://openqa.suse.de/tests/1227894#step/disable_grub_timeout/4
#35 Updated by SLindoMansilla over 5 years ago
- H1: REJECTED by E1-1 machines/workers with 2G, are causing the problem of different hotkey on module disable_grub_timeout.
- H2: Jobs without UEFI on Leap versions older than Leap 15 are causing the problem of different hotkey on module disable_grub_timeout.
H3: Jobs without UEFI on SLE versions older than 12-SP2 are also affected by the problem of different hotkey on module disable_grub_timeout.
E1-1: Find jobs with 2G machines where the module disable_grub_timeout works.
R1-1: works fine for the same build, version and arch for jobs with UEFI:
E2-1: Find all jobs with 2G machines where jobs haven't UEFI set and don't fail on the module disable_grub_timeout.
R2-1: Not found.
E2-2: Find all jobs with 2G machines where jobs haven't UEFI set and fail on the module disable_grub_timeout.
R2-2: All jobs that execute the module disable_grub_timeout and doesn't have UEFI fail:
E3-1: Find jobs with UEFI for SLE 12-SP1 that fails on module disable_grub_timeout.
#36 Updated by SLindoMansilla over 5 years ago
PR to fix non-UEFI jobs for Leap 42.2: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3764
Superseded by https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3762
#37 Updated by okurz over 5 years ago
So the failures I found which I think are most critical is the part which you mentioned as "not implemented yet" for openSUSE Leap maintenance tests. This is something that should also be done on the weekend, maintenance updates never sleep, especially not on openSUSE ;-)
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3762 is a two-fold approach, adds an explicit
assert_screen to make sure the bootloader settings are shown before trying to press any key. Also, it accepts if the key to switch the tab could not be found and returns without failing.
#38 Updated by okurz over 5 years ago
tests fixed for the time being with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3765 by cancelling the configuration module when the hotkey did not match. I think a safer solution is to use
send_key_until_needle_match 'installation-bootloader-options', 'tab'` which should work in case of both ncurses as well as GUI.
#39 Updated by okurz over 5 years ago
- Assignee changed from JERiveraMoya to okurz
SLindoMansilla, JERiveraMoya : There is one hypothesis which you have not mentioned. It seems that there actually is a differing button. I wonder if you have not seen this. But in the needles I created you can see this. Compare https://github.com/os-autoinst/os-autoinst-needles-opensuse/pull/282/files#diff-0a98fc599774cf73e5f165d6483189a5 and https://github.com/os-autoinst/os-autoinst-needles-opensuse/pull/282/files#diff-295cbb639bb910fede484a4855037ee2 . The hotkey on "bootloader options" changes because in the second screen there is an additional button in top-right about "release notes" taking the "alt-l".
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3762 and the according needle PRs covers this now.
#40 Updated by JERiveraMoya over 5 years ago
Investigating why is only typing 0 instead of -1 (this issue makes grub_test to fail). Not reproducible in my local machine. Logs shows -1 is typed. It seems the typing speed.
Textbox only detects "-" and after "ret" comes back to "0".
It might help to use: type_string_slow. PR: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3777
#41 Updated by okurz over 5 years ago
PR merged and active on both osd as well as o3. Now we seem to have the problem with the '0' sometimes so it seems we are not done yet. I will also try to run it more often locally.
type_string_slow would only wait between the characters and it's just two characters. Happened again in https://openqa.opensuse.org/tests/511736#step/disable_grub_timeout/8
I will try to reproduce locally as well. https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3778 as a temporary workaround to disable again.
#43 Updated by okurz over 5 years ago
Could not reproduce the problem with "0 typing" locally in 40 runs so only my best guess stays that the production environment is different here and something like a
wait_still_screen(1) could help for production. This is included in my PR. I leave it to reviewers to decide about that.
#44 Updated by JERiveraMoya over 5 years ago
Missing needles for SLE-12 GA: https://gitlab.suse.de/openqa/os-autoinst-needles-sles/merge_requests/544
#45 Updated by okurz over 5 years ago
- Status changed from In Progress to Resolved
- Assignee changed from okurz to JERiveraMoya
I consider ourselves done here. The last change I did was add an additional
wait_still_screen(1) to fix a sporadic problem we could never see locally. https://openqa.suse.de/tests/1231772 is the verification on SLE15,
https://openqa.suse.de/tests/1232728#step/disable_grub_timeout/14 is one on SLE 12 SP2 maintenance tests. https://openqa.opensuse.org/tests/512274 is an openSUSE staging test. I am currently monitoring maintenance tests for any missing parts, e.g. missing needles or jobs that still need to be retriggered which I consider out of scope of this ticket so setting to resolved. Also setting assignee back to JERiveraMoya who did the main work here.