Project

General

Profile

Actions

action #120570

closed

[qe-core][functional][tools] test fails in bootloader because root device is not ready and it leads to kernel panic size:M

Added by zluo over 1 year ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Support
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario sle-15-SP5-Online-ppc64le-textmode+role_textmode@ppc64le-hmc fails in
bootloader

Test suite description

Maintainers: QE Core, mgriessmeier

Like default but explicitly select the system role "textmode".

Reproducible

Fails since (at least) Build 40.1 (current job)
This seems to be sporadic issue, need to invesgate further.

Expected result

Last good: 38.1 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues 1 (0 open1 closed)

Related to openQA Tests - action #122143: [qe-core][functional] test fails in bootloader because grub rescue mode entered due to network issueResolvedzluo2022-12-19

Actions
Actions #1

Updated by openqa_review about 1 year ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: textmode+role_textmode@ppc64le-hmc
https://openqa.suse.de/tests/10028741#step/bootloader/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #2

Updated by zluo about 1 year ago

  • Status changed from New to In Progress
  • Assignee set to zluo

take over and check

Actions #3

Updated by zluo about 1 year ago

https://openqa.suse.de/tests/10028741#step/bootloader/24

looks like that initrd cannot be loaded, network issue for nfs mount to mnt directory?

Actions #4

Updated by zluo about 1 year ago

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/16029/ updated, still not covered all possible sporadic issues yet ;)

Actions #5

Updated by zluo about 1 year ago

https://openqa.suse.de/tests/10106734#step/bootloader/45

this could be an acceptable case for re-trying loading.

Actions #6

Updated by zluo about 1 year ago

I think we have to live with it for now:

https://openqa.suse.de/tests/10152427
Re-trying after reset_lpar_netboot still not working and hit timeout.

Actions #7

Updated by zluo about 1 year ago

https://openqa.suse.de/tests/10164024#next_previous latest test runs after PR got updated for review.

Actions #8

Updated by zluo about 1 year ago

  • Status changed from In Progress to Feedback

PR merged.

Actions #9

Updated by zluo about 1 year ago

  • Related to action #122143: [qe-core][functional] test fails in bootloader because grub rescue mode entered due to network issue added
Actions #10

Updated by zluo about 1 year ago

  • Status changed from Feedback to Resolved

set is as resolved now.

Actions #11

Updated by okurz about 1 year ago

  • Status changed from Resolved to Feedback

Hi, this can't be resolved as long as there are soft-failure references to this ticket https://openqa.suse.de/tests/10375315#step/bootloader/25 so please make sure the according test code does not reference this or any other ticket in a soft-fail.

Actions #12

Updated by openqa_review 12 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: textmode+role_textmode@ppc64le-hmc
https://openqa.suse.de/tests/10562883#step/bootloader/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 68 days if nothing changes in this ticket.

Actions #13

Updated by zluo 11 months ago

the real root cause I think is the directory issue on qanet. If network has problem (mounts directory is not working for example, then we have issue to load the initrd and linux kernel.
So I can remove the workaround for a test.

Actions #14

Updated by zluo 11 months ago

https://progress.opensuse.org/issues/120570 this looks not good and it seems to be an new issue with network.

Actions #15

Updated by zluo 11 months ago

zluo wrote:

https://progress.opensuse.org/issues/120570 this looks not good and it seems to be an new issue with network.

https://openqa.suse.de/tests/10807336 shows that grub menu data can be transferred and displayed. The network issue cannot be resolved by any workaround.

Actions #16

Updated by zluo 11 months ago

https://openqa.suse.de/tests/10811737#next_previous shows some failure. This is for sure network issue at moment.

Actions #17

Updated by openqa_review 10 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: textmode+role_textmode@ppc64le-hmc
https://openqa.suse.de/tests/10924667#step/bootloader/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #18

Updated by zluo 10 months ago

re-triggered and it looks good:

https://openqa.suse.de/tests/10938921

Actions #19

Updated by openqa_review 10 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: textmode+role_textmode@ppc64le-hmc
https://openqa.suse.de/tests/10940988#step/bootloader/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #20

Updated by mgrifalconi 10 months ago

  • Status changed from Feedback to Workable

Hello, don't see the reason to keep this ticket in feedback.

The record_softfailure will reopen the ticket automatically as Oliver said.
To resolve the ticket we should remove the softfailure mark in the code and just retry on failures or invest more on the problem and actually solve it.

There is no chance to fix from our side, this is clearly an issue on qanet and network issue.

Actions #21

Updated by zluo 9 months ago

  • Assignee changed from zluo to okurz

@okurz

the root cause is on qanet and network issue happened sporadic. With my previous workaround(re-try, reset) it cannot be fixed.
So please help to fix the issue. I can remove softfail of course, then we go back to the problem as we had before.

Actions #22

Updated by zluo 9 months ago

  • Category changed from Bugs in existing tests to Infrastructure
Actions #23

Updated by okurz 9 months ago

  • Tags changed from bugbusters to bugbusters, infra
  • Subject changed from [qe-core][functional] test fails in bootloader because root device is not ready and it leads to kernel panic to [qe-core][functional][tools] test fails in bootloader because root device is not ready and it leads to kernel panic
  • Status changed from Workable to New
  • Assignee deleted (okurz)
  • Priority changed from Normal to High
  • Target version changed from QE-Core: Ready to Ready
Actions #24

Updated by okurz 9 months ago

  • Project changed from openQA Tests to openQA Project
  • Due date set to 2023-06-23
  • Category changed from Infrastructure to Support
  • Status changed from New to Feedback
  • Assignee set to okurz

zluo wrote:

https://progress.opensuse.org/issues/120570 this looks not good and it seems to be an new issue with network.

  1. You are just referencing this ticket itself. Did you want to include another reference?

zluo wrote:

@okurz

the root cause is on qanet and network issue happened sporadic. With my previous workaround(re-try, reset) it cannot be fixed.
So please help to fix the issue. I can remove softfail of course, then we go back to the problem as we had before.

  1. Could you please share a bit more context what you think the issue is?

Following the openQA test URL from #120570-19 in "Next & Previous" I find as latest job in this scenario failing with the same error symptoms
https://openqa.suse.de/tests/11162717
In https://openqa.suse.de/tests/11162717#step/bootloader/25 I can see the job loading initrd from the file path "mnt/openqa/repo/SLE-15-SP5-Online-ppc64le-Build102.1-Media1/boot/ppc64le/initrd. That's a path on qanet relative to /srv/tftp . The file /srv/tftp/mnt/openqa/repo/SLE-15-SP5-Online-ppc64le-Build102.1-Media1/boot/ppc64le/initrd exists and it is there right now. It's an "XZ compressed data" so I am pretty sure it is intact and it could be read in grub, otherwise grub would have reported a timeout reading or something. Also https://openqa.suse.de/tests/11176137#step/bootloader/25 on the same machine grenache-1:22 "redcurrant-2" passed and had no problems reading the same file

  1. How to reproduce?

  2. To keep the overview I suggest you update the ticket description according to the template https://progress.opensuse.org/projects/openqav3/wiki/#Further-decision-steps-working-on-test-issues and follow https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation to better understand the statistics

  3. For further investigation I suggest to only schedule tests with the test module "bootloader" and with video enabled

Actions #25

Updated by livdywan 9 months ago

  • Subject changed from [qe-core][functional][tools] test fails in bootloader because root device is not ready and it leads to kernel panic to [qe-core][functional][tools] test fails in bootloader because root device is not ready and it leads to kernel panic size:M
Actions #26

Updated by okurz 9 months ago

  • Priority changed from High to Normal

reducing prio as there is apparently less interest from reporter.

Actions #27

Updated by okurz 9 months ago

  • Status changed from Feedback to Resolved

I assume the problem resolved itself because unfortunately there is no further response. I checked if there are any recent job labels using this ticket but openqa-query-for-job-label 120570 shows that we are good:

11162717|2023-05-19 15:17:50|done|failed|textmode+role_textmode||grenache-1
11146733|2023-05-17 03:26:25|done|failed|textmode+role_textmode||grenache-1
11140430|2023-05-16 15:01:14|done|failed|textmode+role_textmode||grenache-1
Actions #28

Updated by openqa_review 7 months ago

  • Status changed from Resolved to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: textmode+role_textmode@ppc64le-hmc
https://openqa.suse.de/tests/11162717#step/bootloader/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #29

Updated by okurz 7 months ago

  • Due date deleted (2023-06-23)
  • Status changed from Feedback to Resolved

reminded rfan1 about the SLE15-SP6 setup in #131531 which is relevant here. That might be enough.

Actions

Also available in: Atom PDF