Project

General

Profile

Actions

action #117808

open

[qe-sap][y] Grub error: Timeout reading in bootloader on PowerVM

Added by rainerkoenig over 1 year ago. Updated about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Bugs in existing tests
Target version:
-
Start date:
2022-10-10
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

Example of failure in Migration job group (but the problem is not only seen in Migration):
openQA test in scenario sle-15-SP5-Migration-from-SLE15-SPx-ppc64le-online_sles15sp2_ltss_pscc_basesys-srv-desk_def_full_zypp_pre@ppc64le-hmc-single-disk fails in
bootloader

Looks like openQA is sending keystrokes to type the boot parameters to slow and gets interrupted by this grub timeout error.

All the scenarios for PowerVM are failing for Product validation in YaST group.
https://openqa.suse.de/tests/overview?arch=&flavor=&machine=ppc64le-hmc-single-disk%2Cppc64le-hmc-single-disk&test=&modules=&module_re=&version=15-SP5&groupid=129&build=40.1&distri=sle#


Related issues 1 (1 open0 closed)

Related to openQA Tests - action #117700: [qe-sap][y] Grub error: Stuck in "Welcome to Grub!" on PowerVMNew2022-10-07

Actions
Actions #1

Updated by maritawerner over 1 year ago

  • Project changed from openQA Tests to openQA Infrastructure
  • Category deleted (Infrastructure)

I'm sure that we have already some tickets around that topic....

Actions #2

Updated by okurz over 1 year ago

  • Related to action #117700: [qe-sap][y] Grub error: Stuck in "Welcome to Grub!" on PowerVM added
Actions #3

Updated by okurz over 1 year ago

  • Project changed from openQA Infrastructure to openQA Tests
  • Subject changed from Grub error: Timeout reading in bootloader on PowerVM to [qe-sap][y] Grub error: Timeout reading in bootloader on PowerVM
  • Description updated (diff)
  • Category set to Bugs in existing tests

Looks like openQA is sending keystrokes to type the boot parameters to slow and gets interrupted by this grub timeout error.

I don't think this is the case. https://openqa.suse.de/tests/9680445#step/bootloader/22 shows the error but it might be more clear to see in the video starting on https://openqa.suse.de/tests/9680445/video?filename=video.ogv&t=17.54,17.58 . The command line is completely typed and I assume is sent with a newline. Then it takes about 30s (04:09:30 to 04:10:00) until the message "error: timeout reading" shows up. A quick web research has not revealed much. https://lists.gnu.org/archive/html/help-grub/2015-09/msg00028.html mentions a 7 year old grub patch that should fix such problem and https://www.linuxquestions.org/questions/slackware-14/problems-booting-installer-to-uefi-system-via-pxe-4175696093/ describes timeouts when reading over network. Here we have a PowerPC system using the HMC interface so I assume network is also involved here. "grenache-1:24" corresponds to https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L1284 on "redcurrant".

To me this looks like either a problem within the PowerPC infrastructure and you would need to get in contact with "PowerPC experts within SUSE" to find out if this is something that can be improved or should be expected. AS alternative or as consequence if the answer is that this is an issue to be expected is to build repetition on failure within the bootloader code that does the loading of the kernel image. The maintainer of the "bootloader" test module is Jozef Pupava so assigning to [qe-sap] accordingly but also adding [y] for QE YaST as the job group owners.

Actions #4

Updated by openqa_review over 1 year ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: offline_sles15sp2_ltss_pscc_basesys-srv-lp_all_full_common-criteria_pre
https://openqa.suse.de/tests/9776089#step/bootloader/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #5

Updated by michals over 1 year ago

Why do you think network is involved?

Where is the boot file located?

This is relative to some grub root but it is not clear what the root is.

Actions #6

Updated by rainerkoenig over 1 year ago

I would not bet that it is a network problem. We also see this issue on the FULL medium:

Actions #7

Updated by michals over 1 year ago

Only "timeout reading" message in grub is in network code, though:

grub-core/net/net.c: grub_error (GRUB_ERR_TIMEOUT, N_("timeout reading `%s'"), net->name);

Actions #8

Updated by michals over 1 year ago

Can you provide packet dump corresponding to the transfer that times out?

Actions #10

Updated by rainerkoenig over 1 year ago

michals wrote:

Can you provide packet dump corresponding to the transfer that times out?

Sorry, I have no idea how I can obtain the traffic when this problem occurs.

Actions #11

Updated by michals over 1 year ago

Then there is not much that can be done.

Probably some packets are lost somewhere sometimes, and sometimes so many are lost that the hoot fails.

If there is some option to increase this timeout in grub you may want to try that.

Actions #12

Updated by JERiveraMoya over 1 year ago

I don't think that is an option for grub, those timeout for grub are not related to this specific case for introducing a command which under the hood seems to do some tftp downloading. There are all over the place google results pointing to some patch to be applied to grub:
https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1900773
Do we know if we have this machines properly updated/patched?

Actions #13

Updated by michals over 1 year ago

This fix is part of grub 2.06 which is released at least in 15.4

The breakage is caused by 781b3e5efc3 which is also only released in 2.06 so this would only be observed with grub specifically patched with 781b3e5efc3 but missing the fix.

Also this only applies to files bigger than 64MB or so, and consistently.

Actions #14

Updated by rainerkoenig over 1 year ago

michals wrote:

This fix is part of grub 2.06 which is released at least in 15.4

This raises an infrastructure question: When SLE >= 15.4 ships grub version 2.06,
then why does my screenshot of the problem shos "GNU GRUB version 2.02~beta" on top?

Are we having an infrastructure issue with an old grub version used in the openQA setup?

Actions #15

Updated by JERiveraMoya over 1 year ago

  • Description updated (diff)

checking in the logs for any other job we have package grub2 [2.06-150500.15.1.x86_64] for sle 15 sp5, but in this case the grub that we see is the one for Power HMC, (if I'm not wrong).
Therefor, or this machine is updated or we need some workaround?
btw the size of the file seems to be 46MB: https://openqa.suse.de/assets/repo/SLE-15-SP5-Online-ppc64le-Build40.1-Media1/boot/ppc64le/linux

Actions #16

Updated by michals over 1 year ago

I don't think HMC i involved in any way.

Generating the grub image that is served by the TFTP server is a manual process which has apparently not been automated so far.

Then the grub image served by the TFTP server is obsolete until manually updated.

Actions #17

Updated by JERiveraMoya over 1 year ago

sorry, you are right, we control the lpar from the hmc when we start the automation, but the grub we see is in the lpar, served by the TFTP as you mentioned.
who we could ask for help to do this manual process? :)

Actions #18

Updated by JERiveraMoya over 1 year ago

also it is a more generic issue related with networking, even in VM we can see it:
https://openqa.suse.de/tests/10013125#

Actions #19

Updated by openqa_review over 1 year ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_migration_offline_dvd_sle15sp2_ltss_pre
https://openqa.suse.de/tests/10097391#step/bootloader_start/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #21

Updated by openqa_review about 1 year ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: autoupgrade_sles15sp2_ltss_scc_basesys-srv-lp-tsm_def_full_ssh_auto
https://openqa.suse.de/tests/10227107#step/bootloader/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #22

Updated by JERiveraMoya about 1 year ago

We have 2 volunteers in the squad to learn PowerVM, but other than that, we don't know how to tackle that particular grub issue.
We need expertise to update that grub or also it could help to change the workers to another fast network to avoid to hit that problem.

We are moving to spvm for now to workaround it. According to recent investigation it is more stable:
https://openqa.suse.de/tests/overview?distri=sle&version=15-SP5&build=hjluo%2Fos-autoinst-distri-opensuse%23spvm_clone

Actions #23

Updated by openqa_review about 1 year ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: autoupgrade_sles15sp2_ltss_scc_basesys-srv-lp-tsm_def_full_ssh_auto
https://openqa.suse.de/tests/10341935#step/bootloader/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #24

Updated by openqa_review 11 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: RAID10@ppc64le-hmc-4disk
https://openqa.suse.de/tests/11023168#step/bootloader_start/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 56 days if nothing changes in this ticket.

Actions #25

Updated by szarate 10 months ago

  • Tags set to qe-core-june-sprint
Actions #26

Updated by michals about 1 month ago

JERiveraMoya wrote in #note-18:

also it is a more generic issue related with networking, even in VM we can see it:
https://openqa.suse.de/tests/10013125#

Then it's much easier to reproduce.

I suppose if you get network problem with the OS running you can get some network capture on the OS side, find out how often the problem happens if a transfer is tried mutiple times in a loop, etc.

Actions

Also available in: Atom PDF