Project

General

Profile

Actions

action #88273

closed

[qem][qe-sap][ha] grub2 error: invalid magic number

Added by dzedro about 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Bugs in existing tests
Target version:
-
Start date:
2021-01-27
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-12-SP3-Server-DVD-HA-Updates-x86_64-qam_ha_rolling_update_node01@64bit fails in
console_reboot

Reproducible

Fails since (at least) Build 20210126-1 (current job)

Expected result

Last good: 20210125-2 (or more recent)

Further details

Always latest result in this scenario: latest

Actions #1

Updated by okurz about 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_update_node02
https://openqa.suse.de/tests/5483858

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #2

Updated by openqa_review about 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_update_node02
https://openqa.suse.de/tests/5649219

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #3

Updated by okurz about 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_update_node01
https://openqa.suse.de/tests/5741396

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #4

Updated by okurz about 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_upgrade_migration_node02
https://openqa.suse.de/tests/5818158

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #5

Updated by okurz almost 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_upgrade_migration_node02
https://openqa.suse.de/tests/5909435

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #6

Updated by osukup almost 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_update_node01
https://openqa.suse.de/tests/6002065

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #7

Updated by okurz almost 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_update_node02
https://openqa.suse.de/tests/6141214

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #8

Updated by okurz almost 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_update_node01
https://openqa.suse.de/tests/6230455

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #9

Updated by acarvajal almost 3 years ago

This test runs in a MM setup, with a job for node 1 in an HA cluster, and another job for node 2.

Both nodes boot from the same qcow2 image, i.e. HDD_1=sle-12-SP3-x86_64-ha-rolling_upgrade_update.qcow2, register the system to SCC, run a zypper patch command and then reboot to start configuring the HA cluster later.

However as can be seen on the results from build 20210618-1, node 2 is unable to boot after the zypper patch, while node 1 can:

node 1: https://openqa.suse.de/tests/6285308#step/console_reboot/4
node 2: https://openqa.suse.de/tests/6285307#step/console_reboot/4

Comparing the settings, I did not find anything that would explain why 2 tests that by that step should have worked identically, are finishing with different results.

I do not even think that any QAM-specific setup is used by the time the tests are failing, as this seems to be a simple "boot into 12-SP3 -> register -> patch & reboot" by the time the test fail. Even comparing both autoinst logs does not show anything that would explain the difference.

Looking around, I see that sometimes it is node 1 the one that is unable to boot, while sometimes it is node 2 as in the examples above.

Since we are starting from the same conditions (same qcow2 image) and are running the same modules, registering the same extensions, etc., I can only guess that there is a race condition causing one of the nodes to fail to boot after patching while it works on the other.

Actions #10

Updated by acarvajal almost 3 years ago

As can be seen on https://openqa.suse.de/tests/6285307#next_previous

Test could successfully finish in the run with build 20210620-1, while it failed with 20210621-1.

Test settings are of course the same save for the build.

Passing tests ran on openqaworker10:5, openqaworker5:6 & openqaworker10:8

Failing tests ran on openqaworker10:4, openqaworker10:7 & openqaworker5:34

So it doesn't seem to be any difference between the workers used.

I will be cloning the failing test in our development environment and ran the tests several times to try to reproduce, but I suspect a performance-related race condition to be affecting these results.

Actions #11

Updated by acarvajal almost 3 years ago

Cloned these jobs into our development environment with:

openqa-clone-job --host localhost --clone-children https://openqa.suse.de/tests/6298499 WORKER_CLASS=tap,qemu_x86_64

All tests passed in 5 consecutive runs:

1) http://mango.qa.suse.de/tests/4107 & http://mango.qa.suse.de/tests/4108
2) http://mango.qa.suse.de/tests/4110 & http://mango.qa.suse.de/tests/4111
3) http://mango.qa.suse.de/tests/4113 & http://mango.qa.suse.de/tests/4114
4) http://mango.qa.suse.de/tests/4116 & http://mango.qa.suse.de/tests/4117
5) http://mango.qa.suse.de/tests/4119 & http://mango.qa.suse.de/tests/4120

Test code it's current for the used modules when comparing it with osd:

root@mango:/var/lib/openqa/tests/sle [master|…1] > git log --oneline | head -5
203e34915 Merge pull request #12751 from jlausuch/containers_add_ubi_images
289af6565 Use --force-share to perform operations on image
1006cb0e2 Containers: add Ubi6 and Ubi7 images to the 3rd party image tests
0f6e5c483 lib/utils.pm: Support two stages of zypper patch returning 103
22251967e Merge pull request #12747 from pdostal/tw_ssh

vs.

openqa:/var/lib/openqa/tests/sle # git log --oneline | head -5
d9c826ef4 Merge pull request #12762 from grisu48/run_ltp
e5d7c3c6c Merge pull request #12753 from jlausuch/basic_k3s_test
203e34915 Merge pull request #12751 from jlausuch/containers_add_ubi_images
289af6565 Use --force-share to perform operations on image
d6f22108a Remove required SCC_REGCODE variable for OnDemand

While the webUI it's at openQA-4.6.1624280432.2ce59c621-lp151.4088.1.noarch in development environment vs openQA-4.6.1624015887.f215823e1-lp152.4085.1.noarch in osd, and openQA-worker is at openQA-worker-4.6.1624280432.2ce59c621-lp152.4087.1.noarch in development environment vs. openQA-worker-4.6.1624015887.f215823e1-lp152.4085.1.noarch in openqaworker10.

However test in osd fails with frequency: https://openqa.suse.de/tests/6285307#next_previous

Could we attempt to move these tests out of openqaworker10 into another worker to see if they fare better?

Any hint at what may be causing this issue? It's frequent enough that it seems to be an actual issue, but clearly it's not an issue that's caused by the test code.

Actions #12

Updated by okurz almost 3 years ago

For reference what I answered in chat:

Actions #13

Updated by acarvajal almost 3 years ago

okurz wrote:

For reference what I answered in chat:

Thanks.

  • What is your current hypothesis? And what experiment did you design to check it?

No current hypothesis. I'm currently blocked with this. Only working detail I have is that test works consistently without issues on our development environment, and it fails sporadically in openqaworker10 before even attempting any HA configuration (i.e., simply boot + zypper up + reboot causes the issue).

As discussed in the chat, I will move this to another worker and monitor, and write & schedule a test module to check the integrity of the qcow2 image when the tests are starting, even though I'm certain that the problem is not being caused by differences in the qcow2 image as it's the same file path in both jobs, and both jobs are running in the same worker.

Whatever is messing the test seems to be coming via an update after running zypper up, which should also be the same for both jobs. If I were to guess, I'd say the updated kernel file is getting corrupted when it is saved in one of the VMs and not in the other.

  • All worker machines within o3 and osd have their hostname as an additional worker class. This is meant to be solely used for investigation. So you can schedule tests on a specific machine, e.g. with WORKER_CLASS=openqaworker9

Yes, I will try this.

Actions #14

Updated by acarvajal almost 3 years ago

I moved the rolling update tests to openqaworker9 via the qam_ha_rolling_update_node01, qam_ha_rolling_update_node02 & qam_ha_rolling_update_support_server test suites.

Let's see if there's a different outcome in this worker.

Actions #15

Updated by okurz almost 3 years ago

acarvajal wrote:

okurz wrote:

For reference what I answered in chat:

Thanks.

  • What is your current hypothesis? And what experiment did you design to check it?

No current hypothesis. I'm currently blocked with this. Only working detail I have is that test works consistently without issues on our development environment, and it fails sporadically in openqaworker10 before even attempting any HA configuration (i.e., simply boot + zypper up + reboot causes the issue).

As discussed in the chat, I will move this to another worker and monitor, and write & schedule a test module to check the integrity of the qcow2 image when the tests are starting, even though I'm certain that the problem is not being caused by differences in the qcow2 image as it's the same file path in both jobs, and both jobs are running in the same worker.

So that means you already have two hypotheses and you have already experiments in mind even if you did not call it such:

  • H1 The worker machine makes a difference -> E1-1 pin jobs to a special worker and check if the fail ratio significantly differs
  • H2 The root storage content is corrupted when loading within openQA tests -> E2-1 schedule a test module to check the integrity of the qcow2 image when the tests are starting
Actions #16

Updated by acarvajal almost 3 years ago

First run on openqaworker9 failed:

Node1: https://openqa.suse.de/tests/6323075#step/console_reboot/4 (failing)
Node2: https://openqa.suse.de/tests/6323076 (not failing)

While the last runs on openqaworker10 worked:

Node1: https://openqa.suse.de/tests/6312914
Node2: https://openqa.suse.de/tests/6312913

Will leave this as is for a couple of days to get more results, but initial results are not encouraging.

Actions #17

Updated by acarvajal almost 3 years ago

okurz wrote:

  • H2 The root storage content is corrupted when loading within openQA tests -> E2-1 schedule a test module to check the integrity of the qcow2 image when the tests are starting

And based on my last message I will have to assign a higher priority to H2. However, I'm certain the root storage content is not corrupted. I think whatever corruption is there, comes after zypper up.

But let's see what the results of E2-1 are before anything else.

Actions #18

Updated by acarvajal almost 3 years ago

I am submitting https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/12791 to record checksums/digests of the HDD_# assets and of the kernel files to see if this can give a better hint of what may be happening in osd with this tests.

@okurz @grifalconi @dzedro could you help me with a review? Thanks in advance.

Actions #19

Updated by acarvajal almost 3 years ago

We have some results from the investigation, and they are inconclusive.

On the passing tests in the development environment, we can see the digest for the qcow2 image is the same right after setup:

Node 1: http://mango.qa.suse.de/tests/4143#step/show_hdd_info/1
Node 2: http://mango.qa.suse.de/tests/4144#step/show_hdd_info/1

hd0_overlay0's digest is different, but this is expected.

Digests remain unchanged after zypper up:

Node 1: http://mango.qa.suse.de/tests/4143#step/show_hdd_info#1/1
Node 2: http://mango.qa.suse.de/tests/4144#step/show_hdd_info#1/1

And kernel digest seems the same between both nodes:

Node 1: http://mango.qa.suse.de/tests/4143#step/check_boot_files/4
Node 2: http://mango.qa.suse.de/tests/4144#step/check_boot_files/4

After this, both nodes successfully reboot and test passes.

On a failing test in openqaworker9, it can be also seen that digest for the disks and the kernel between the 2 nodes are the same:

Node 1: https://openqa.suse.de/tests/6345325#step/show_hdd_info/1
Node 2: https://openqa.suse.de/tests/6345326#step/show_hdd_info/1

Same after zypper up:

Node 1: https://openqa.suse.de/tests/6345325#step/show_hdd_info#1/1
Node 2: https://openqa.suse.de/tests/6345326#step/show_hdd_info#1/1

And kernel after zypper up:

Node 1: https://openqa.suse.de/tests/6345325#step/check_boot_files/4
Node 2: https://openqa.suse.de/tests/6345326#step/check_boot_files/4

However, node 1 failed to boot with the new kernel:

https://openqa.suse.de/tests/6345325#step/console_reboot/4

While node 1 successfully booted:

https://openqa.suse.de/tests/6345326#step/console_reboot/4

So at least it seems the cause is not any disk or file corruption

Actions #20

Updated by dzedro almost 3 years ago

Why should sle-12-SP3-x86_64-ha-rolling_upgrade_update.qcow2 change ... ?

[2021-06-28T15:26:00.645 CEST] [debug] running /usr/bin/qemu-img create -f qcow2 -b /var/lib/openqa/pool/16/sle-12-SP3-x86_64-ha-rolling_upgrade_update.qcow2 /var/lib/openqa/pool/16/raid/hd0-overlay0 16106127360
[2021-06-28T15:26:00.659 CEST] [debug] Formatting '/var/lib/openqa/pool/16/raid/hd0-overlay0', fmt=qcow2 size=16106127360 backing_file=/var/lib/openqa/pool/16/sle-12-SP3-x86_64-ha-rolling_upgrade_update.qcow2 
Actions #21

Updated by openqa_review almost 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_update_node01
https://openqa.suse.de/tests/6435426

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The label in the openQA scenario is removed
Actions #22

Updated by acarvajal over 2 years ago

Issue is also present on 12-SP4 rolling upgrade scenario:

https://openqa.suse.de/tests/6489800
https://openqa.suse.de/tests/6490880#step/console_reboot/4

Also booting from the same 12-SP3 qcow2 image.

So in 12-SP3 job groups the issue is present when doing rolling update (12-SP3 to 12-SP3 with MU updates) and in 12-SP4 when doing rolling upgrade (12-SP3 to 12-SP4 with MU updates). Whenever issue is present, is right after updating the 12-SP3 systems, but before migrating.

Actions #24

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_update_node02
https://openqa.suse.de/tests/6647769

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The label in the openQA scenario is removed
Actions #25

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_upgrade_migration_node01
https://openqa.suse.de/tests/6881870

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The label in the openQA scenario is removed
Actions #26

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_update_node01
https://openqa.suse.de/tests/6988954

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The label in the openQA scenario is removed
Actions #27

Updated by okurz over 2 years ago

  • Priority changed from Normal to High

There are multiple reminder comments and the ticket was brought up in chat by vpelcak and discussed with jmichel. I think it's more efficient if qe-sap takes responsibility and ensures that there are no false positives. One way or the other. Other teams achieve it so qe-sap can as well. "Willing to help" won't cut it but of course QE Tools is "willing to help" as well ;) if you don't know how to proceed just unschedule the tests. Keeping false positives is very harmful, wastes resources and causes alarm fatigue

Actions #28

Updated by maritawerner over 2 years ago

  • Assignee set to jctmichel
Actions #29

Updated by okurz over 2 years ago

  • Subject changed from [qem][qe-asg][HA] grub2 error: invalid magic number to [qem][qe-sap][ha] grub2 error: invalid magic number
  • Assignee deleted (jctmichel)

Using keyword "qe-sap" as verified by jmichel in weekly QE sync 2021-09-15

Actions #30

Updated by okurz over 2 years ago

  • Assignee set to jctmichel
Actions #31

Updated by jadamek over 2 years ago

Well, the issue should be fixed now.
I was able to reproduce the issue without the MM config and without the HA modules so nothing related neither HA nor MM.

There was a sporadic issue when the kernel is installed during the 12-SP3 LTSS (maybe it deserves a bug? I still don't understand why we are the only ones who test migrations in QAM)

The solution was to:

  • Download the qcow2 on my laptop
  • Boot the SLE12SP3 system
  • Register system + LTSS + HA
  • Update
  • Deregister
  • Reboot and make sure it rebooted well
  • Cleanup the system
  • Upload the qcow2 to openqa

Let's monitor the Test Repo jobgroup until next Monday before closing the ticket.

Actions #32

Updated by jctmichel over 2 years ago

Issue is now fixed and the test is passing.

Actions #33

Updated by jctmichel over 2 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF