Project

General

Profile

action #25658

[sle][functional][migration][opensuse][virtualization]Increase/disable timeout of initial grub menue to ensure tests do not miss it

Added by okurz over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Bugs in existing tests
Start date:
2017-09-29
Due date:
2017-10-25
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

There are many openQA tests in all different scenarios regardless of the worker or machine type which sometimes happen to skip over the grub menue and then fail in a fully booted desktop still looking for "grub2". Sometimes it shows up in the video for a glimpse, sometimes not.

Example: opensuse-5.10.90-Krypton-Live-x86_64-krypton-live-installation@64bit-2G fails in grub_test

Reproducible

All different kind of scenarios but not reproducible every time

Expected result

After the machine reboots the grub screen should be catched within the 8 second default timeout: https://openqa.opensuse.org/tests/494564#step/grub_test/1

Problem

It seems we can not ensure a reliable testing environment where the full screen content shows up in the early phases of boot. This is either a problem we always had but was not important enough to handle because of less test scenarios or there were changes in os-autoinst that introduced a regression.

Suggestion

In any case a fix to prevent this would be to configure the bootloader in the installer to use a much higher timeout, e.g. 60 seconds, or disable it fully.

  • take a look in tests/installation/disable_grub_graphics.pm how it enters the bootloader configuration menue from the installer and disables the grub graphics (time estimation: 0.1 - 0.5h)
  • use the same approach in a new test module with explicit name, e.g. 'change_grub_timeout.pm' to bump the timeout to 60 seconds (or a similar high value) or disable the grub timeout (time estimation: 0.5 - 2h)
  • make sure this test module is used in all relevant tests, e.g. all openSUSE+SLE installation tests, including the live cd (time estimation: 0.5 - 4h)
  • changes to "first_boot" or the wait_boot function should not be necessary because the bootloader menue should then still show up and is searched for within the normal timeout but by bumping the grub timeout itself we should ensure the bootloader menue to show up long enough so that it can be catched

Further details

Always latest result in this scenario: latest


Related issues

Related to openQA Tests - action #26116: [sle][functional] test fails in disable_grub_timeout - test flow seems wrongResolved2017-10-172017-11-08

Related to openQA Tests - action #26868: [sle][functional][hyperv]test fails in disable_grub_timeout, 'alt-t' was pressed when it should have been 'alt-r'Resolved2017-10-192017-11-08

Related to openQA Tests - action #26898: [sle][functional][opensuse]boot_to_snapshot: test fails in first_bootResolved2017-10-19

Related to openQA Tests - action #26936: disable_grub_timeout test brokenResolved2017-10-222017-10-23

Related to openQA Tests - action #26978: autoyast/gnome installation stuck on grubResolved2017-10-242017-11-08

Blocks openQA Tests - action #12894: [sles][functional][ppc64] boot_to_snapshot sporadic fails in grub_testResolved2016-07-27

Blocks openQA Tests - action #17502: [sle][functional]test fails in grub_test on timeout waiting for inst-bootmenu Resolved2017-03-032017-11-08

Blocks openQA Tests - action #18304: [sles][migration] test fails in grub_test_snapshot: Can't find desired snapshot in grub menuRejected2017-04-03

Blocks openQA Tests - action #25388: [sle][functional][sle15][aarch64]test fails in grub_testResolved2017-09-182017-11-08

Blocks openQA Tests - action #25376: [sle][functional][ppc64le][hard] test fails in grub_test - stuck in grub screen, expected encrypted password prompt (was: boots completely)Resolved2017-09-182017-12-06

Blocks openQA Tests - action #25664: [opensuse][qam]test fails to see grub menue in time in console_reboot even though it's visible but probably just a little bit too lateClosed2017-09-29

History

#1 Updated by okurz over 5 years ago

  • Related to action #12894: [sles][functional][ppc64] boot_to_snapshot sporadic fails in grub_test added

#2 Updated by okurz over 5 years ago

  • Blocked by action #17502: [sle][functional]test fails in grub_test on timeout waiting for inst-bootmenu added

#3 Updated by okurz over 5 years ago

  • Related to deleted (action #12894: [sles][functional][ppc64] boot_to_snapshot sporadic fails in grub_test)

#4 Updated by okurz over 5 years ago

  • Blocks action #12894: [sles][functional][ppc64] boot_to_snapshot sporadic fails in grub_test added

#5 Updated by okurz over 5 years ago

  • Blocked by deleted (action #17502: [sle][functional]test fails in grub_test on timeout waiting for inst-bootmenu )

#6 Updated by okurz over 5 years ago

  • Blocks action #17502: [sle][functional]test fails in grub_test on timeout waiting for inst-bootmenu added

#7 Updated by okurz over 5 years ago

  • Blocks action #18304: [sles][migration] test fails in grub_test_snapshot: Can't find desired snapshot in grub menu added

#8 Updated by okurz over 5 years ago

  • Blocks action #25388: [sle][functional][sle15][aarch64]test fails in grub_test added

#9 Updated by okurz over 5 years ago

  • Blocks action #25376: [sle][functional][ppc64le][hard] test fails in grub_test - stuck in grub screen, expected encrypted password prompt (was: boots completely) added

#10 Updated by okurz over 5 years ago

  • Due date set to 2017-10-25
  • Target version set to Milestone 11

#11 Updated by okurz over 5 years ago

  • Blocks action #25664: [opensuse][qam]test fails to see grub menue in time in console_reboot even though it's visible but probably just a little bit too late added

#12 Updated by okurz over 5 years ago

  • Status changed from New to In Progress
  • Assignee set to JERiveraMoya

JERiveraMoya partnering with SLindoMansilla

#13 Updated by JERiveraMoya over 5 years ago

https://github.com/jknphy/os-autoinst-distri-opensuse/commit/780b957b90ad51d190e2949f22fe985d61ed9630
https://gitlab.suse.de/JERiveraMoya/os-autoinst-needles-sles/commit/82dad959a13b06477284a1a25e225c8425f65080
Need also needles for Tumbleweed, pending to send another PR to my fork and check with Sergio about these two above in my forks for being improved before to send to the official repo.
These two should cover scenarios in Text mode and Graphic for Tumbleweed. Successfully tested in my local four scenarios.

#14 Updated by okurz over 5 years ago

feel free to provide a PR and mark the subject line as "[WIP]" to prevent it from being merged but invite for early feedback. Until then I will take a look into your github commit and comment there.

EDIT: hm, I thought I could comment on a commit as well but I can't. Please provide a PR then

#15 Updated by okurz over 5 years ago

To make needling easier we should fully disable the timeout. Can you do that?

#16 Updated by JERiveraMoya over 5 years ago

GRUB_TIMEOUT=-1 will cause the menu to be displayed until it is selected the boot entry manually.
Renamed tests and needles from change_grub_timeout to disable_grub_timeout.
This change does not make needling easier, given it is just type -1 instead of 60 seconds. Needles added just are need to ensure the path to the field where the value is typed (i.e.: correct tab selected, etc.)
Found missing needles in Tumbleweed text mode. Currently fixing it before sending [WIP] PR.

#17 Updated by okurz over 5 years ago

great news. So good change. Feel free to create a PR anyway when it works for just one scenario. We will comment on the PR and I am sure there is at least something to change which will have an impact. Better to use fast feedback

#20 Updated by JERiveraMoya over 5 years ago

Tested for Leap 42.3 Updates and Maintenance. New needles were created for Opensuse Leap. There are some subtle differences between Tumbleweed and Leap (i.e.: the lines to draw menu tabs have difference length, selected text has a square bigger in one than the other, etc.)

#21 Updated by JERiveraMoya over 5 years ago

Added several improvements in the PRs based on review comments: syntax, better use of OpenQA libraries, visualization of the change with a screenshot, etc.

#22 Updated by osukup over 5 years ago

And it brokes all QAM workflow ...

#23 Updated by JERiveraMoya over 5 years ago

As discussed with osukup, taking a look for re-needling in Maintenance: Test repo job group.

#24 Updated by okurz over 5 years ago

  • Related to action #26116: [sle][functional] test fails in disable_grub_timeout - test flow seems wrong added

#25 Updated by okurz over 5 years ago

#26 Updated by JERiveraMoya over 5 years ago

Fixed several issues for other products:

  • Test included in more test scenarios.
  • SLE 12 GA: Fixed problem related with the range accepted for Timeout [0,300]. Instead of -1, for this product version is typed 60 seconds.
  • SLE 12 SP1: missing needles.
  • Opensuse Leap: different shortcut to access tab & corresponding needles.

#28 Updated by okurz over 5 years ago

Looks like wrong hotkey for leap 42.3. https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3742 should be able to handle this.verified

#29 Updated by okurz over 5 years ago

-> #26868

#30 Updated by okurz over 5 years ago

  • Related to action #26868: [sle][functional][hyperv]test fails in disable_grub_timeout, 'alt-t' was pressed when it should have been 'alt-r' added

#31 Updated by okurz over 5 years ago

  • Related to action #26898: [sle][functional][opensuse]boot_to_snapshot: test fails in first_boot added

#32 Updated by JERiveraMoya over 5 years ago

New PR for fixing most of the found failures (hopefully): https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3749
Found today that sle-15 for aarch64 is 'alt + t': https://openqa.suse.de/tests/1224390#step/disable_grub_timeout/3

#33 Updated by JERiveraMoya over 5 years ago

Last PR seems to work except for some needles in SLE-15 that are in different position in the screen, we might need another wait_still_screen(1) before sending key for selecting tab. BUT for leap 42.2 & 42.3 on machine 64bit-2G is not implemented the exception yet, so it is making fail a lot of tests, as expected. Added PR for workaround:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3755

#35 Updated by SLindoMansilla over 5 years ago

  • H1: REJECTED by E1-1 machines/workers with 2G, are causing the problem of different hotkey on module disable_grub_timeout.
  • H2: Jobs without UEFI on Leap versions older than Leap 15 are causing the problem of different hotkey on module disable_grub_timeout.
  • H3: Jobs without UEFI on SLE versions older than 12-SP2 are also affected by the problem of different hotkey on module disable_grub_timeout.

  • E1-1: Find jobs with 2G machines where the module disable_grub_timeout works.

  • R1-1: works fine for the same build, version and arch for jobs with UEFI:

  • E2-1: Find all jobs with 2G machines where jobs haven't UEFI set and don't fail on the module disable_grub_timeout.

  • R2-1: Not found.

  • E2-2: Find all jobs with 2G machines where jobs haven't UEFI set and fail on the module disable_grub_timeout.

  • R2-2: All jobs that execute the module disable_grub_timeout and doesn't have UEFI fail:

  • E3-1: Find jobs with UEFI for SLE 12-SP1 that fails on module disable_grub_timeout.

#37 Updated by okurz over 5 years ago

So the failures I found which I think are most critical is the part which you mentioned as "not implemented yet" for openSUSE Leap maintenance tests. This is something that should also be done on the weekend, maintenance updates never sleep, especially not on openSUSE ;-)

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3762 is a two-fold approach, adds an explicit assert_screen to make sure the bootloader settings are shown before trying to press any key. Also, it accepts if the key to switch the tab could not be found and returns without failing.

#38 Updated by okurz over 5 years ago

tests fixed for the time being with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3765 by cancelling the configuration module when the hotkey did not match. I think a safer solution is to use send_key_until_needle_match 'installation-bootloader-options', 'tab'` which should work in case of both ncurses as well as GUI.

#39 Updated by okurz over 5 years ago

  • Assignee changed from JERiveraMoya to okurz

SLindoMansilla, JERiveraMoya : There is one hypothesis which you have not mentioned. It seems that there actually is a differing button. I wonder if you have not seen this. But in the needles I created you can see this. Compare https://github.com/os-autoinst/os-autoinst-needles-opensuse/pull/282/files#diff-0a98fc599774cf73e5f165d6483189a5 and https://github.com/os-autoinst/os-autoinst-needles-opensuse/pull/282/files#diff-295cbb639bb910fede484a4855037ee2 . The hotkey on "bootloader options" changes because in the second screen there is an additional button in top-right about "release notes" taking the "alt-l".

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3762 and the according needle PRs covers this now.

#40 Updated by JERiveraMoya over 5 years ago

Investigating why is only typing 0 instead of -1 (this issue makes grub_test to fail). Not reproducible in my local machine. Logs shows -1 is typed. It seems the typing speed.
Textbox only detects "-" and after "ret" comes back to "0".
https://openqa.suse.de/tests/1230788#step/disable_grub_timeout/5
https://openqa.suse.de/tests/1231104#step/disable_grub_timeout/8
It might help to use: type_string_slow. PR: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3777

#41 Updated by okurz over 5 years ago

PR merged and active on both osd as well as o3. Now we seem to have the problem with the '0' sometimes so it seems we are not done yet. I will also try to run it more often locally.

type_string_slow would only wait between the characters and it's just two characters. Happened again in https://openqa.opensuse.org/tests/511736#step/disable_grub_timeout/8

I will try to reproduce locally as well. https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3778 as a temporary workaround to disable again.

#42 Updated by SLindoMansilla over 5 years ago

  • Related to action #26936: disable_grub_timeout test broken added

#43 Updated by okurz over 5 years ago

Could not reproduce the problem with "0 typing" locally in 40 runs so only my best guess stays that the production environment is different here and something like a wait_still_screen(1) could help for production. This is included in my PR. I leave it to reviewers to decide about that.

#45 Updated by okurz over 5 years ago

  • Status changed from In Progress to Resolved
  • Assignee changed from okurz to JERiveraMoya

I consider ourselves done here. The last change I did was add an additional wait_still_screen(1) to fix a sporadic problem we could never see locally. https://openqa.suse.de/tests/1231772 is the verification on SLE15,
https://openqa.suse.de/tests/1232728#step/disable_grub_timeout/14 is one on SLE 12 SP2 maintenance tests. https://openqa.opensuse.org/tests/512274 is an openSUSE staging test. I am currently monitoring maintenance tests for any missing parts, e.g. missing needles or jobs that still need to be retriggered which I consider out of scope of this ticket so setting to resolved. Also setting assignee back to JERiveraMoya who did the main work here.

#46 Updated by okurz over 5 years ago

  • Related to action #26978: autoyast/gnome installation stuck on grub added

Also available in: Atom PDF