action #88273: [qem][qe-sap][ha] grub2 error: invalid magic number - openQA Tests (public) - openSUSE Project Management Tool

Actions

#1

Updated by okurz over 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_update_node02
https://openqa.suse.de/tests/5483858

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released"
The label in the openQA scenario is removed

Actions

Copy link

#2

Updated by openqa_review about 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_update_node02
https://openqa.suse.de/tests/5649219

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released"
The label in the openQA scenario is removed

Actions

Copy link

#3

Updated by okurz about 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_update_node01
https://openqa.suse.de/tests/5741396

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released"
The label in the openQA scenario is removed

Actions

Copy link

#4

Updated by okurz about 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_upgrade_migration_node02
https://openqa.suse.de/tests/5818158

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released"
The label in the openQA scenario is removed

Actions

Copy link

#5

Updated by okurz about 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_upgrade_migration_node02
https://openqa.suse.de/tests/5909435

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released"
The label in the openQA scenario is removed

Actions

Copy link

#6

Updated by osukup about 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_update_node01
https://openqa.suse.de/tests/6002065

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released"
The label in the openQA scenario is removed

Actions

Copy link

#7

Updated by okurz about 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_update_node02
https://openqa.suse.de/tests/6141214

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released"
The label in the openQA scenario is removed

Actions

Copy link

#8

Updated by okurz almost 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_update_node01
https://openqa.suse.de/tests/6230455

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released"
The label in the openQA scenario is removed

Actions

Copy link

#9

Updated by acarvajal almost 4 years ago

This test runs in a MM setup, with a job for node 1 in an HA cluster, and another job for node 2.

Both nodes boot from the same qcow2 image, i.e. HDD_1=sle-12-SP3-x86_64-ha-rolling_upgrade_update.qcow2, register the system to SCC, run a zypper patch command and then reboot to start configuring the HA cluster later.

However as can be seen on the results from build 20210618-1, node 2 is unable to boot after the zypper patch, while node 1 can:

node 1: https://openqa.suse.de/tests/6285308#step/console_reboot/4
node 2: https://openqa.suse.de/tests/6285307#step/console_reboot/4

Comparing the settings, I did not find anything that would explain why 2 tests that by that step should have worked identically, are finishing with different results.

I do not even think that any QAM-specific setup is used by the time the tests are failing, as this seems to be a simple "boot into 12-SP3 -> register -> patch & reboot" by the time the test fail. Even comparing both autoinst logs does not show anything that would explain the difference.

Looking around, I see that sometimes it is node 1 the one that is unable to boot, while sometimes it is node 2 as in the examples above.

Since we are starting from the same conditions (same qcow2 image) and are running the same modules, registering the same extensions, etc., I can only guess that there is a race condition causing one of the nodes to fail to boot after patching while it works on the other.

Actions

Copy link

#10

Updated by acarvajal almost 4 years ago

As can be seen on https://openqa.suse.de/tests/6285307#next_previous

Test could successfully finish in the run with build 20210620-1, while it failed with 20210621-1.

Test settings are of course the same save for the build.

Passing tests ran on openqaworker10:5, openqaworker5:6 & openqaworker10:8

Failing tests ran on openqaworker10:4, openqaworker10:7 & openqaworker5:34

So it doesn't seem to be any difference between the workers used.

I will be cloning the failing test in our development environment and ran the tests several times to try to reproduce, but I suspect a performance-related race condition to be affecting these results.

Actions

Copy link

#11

Updated by acarvajal almost 4 years ago

Cloned these jobs into our development environment with:

openqa-clone-job --host localhost --clone-children https://openqa.suse.de/tests/6298499 WORKER_CLASS=tap,qemu_x86_64

All tests passed in 5 consecutive runs:

Test code it's current for the used modules when comparing it with osd:

root@mango:/var/lib/openqa/tests/sle [master|…1] > git log --oneline | head -5
203e34915 Merge pull request #12751 from jlausuch/containers_add_ubi_images
289af6565 Use --force-share to perform operations on image
1006cb0e2 Containers: add Ubi6 and Ubi7 images to the 3rd party image tests
0f6e5c483 lib/utils.pm: Support two stages of zypper patch returning 103
22251967e Merge pull request #12747 from pdostal/tw_ssh

vs.

openqa:/var/lib/openqa/tests/sle # git log --oneline | head -5
d9c826ef4 Merge pull request #12762 from grisu48/run_ltp
e5d7c3c6c Merge pull request #12753 from jlausuch/basic_k3s_test
203e34915 Merge pull request #12751 from jlausuch/containers_add_ubi_images
289af6565 Use --force-share to perform operations on image
d6f22108a Remove required SCC_REGCODE variable for OnDemand

While the webUI it's at openQA-4.6.1624280432.2ce59c621-lp151.4088.1.noarch in development environment vs openQA-4.6.1624015887.f215823e1-lp152.4085.1.noarch in osd, and openQA-worker is at openQA-worker-4.6.1624280432.2ce59c621-lp152.4087.1.noarch in development environment vs. openQA-worker-4.6.1624015887.f215823e1-lp152.4085.1.noarch in openqaworker10.

However test in osd fails with frequency: https://openqa.suse.de/tests/6285307#next_previous

Could we attempt to move these tests out of openqaworker10 into another worker to see if they fare better?

Any hint at what may be causing this issue? It's frequent enough that it seems to be an actual issue, but clearly it's not an issue that's caused by the test code.

Actions

Copy link

#12

Updated by okurz almost 4 years ago

For reference what I answered in chat:

I suggest to use https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation
What is your current hypothesis? And what experiment did you design to check it?
All worker machines within o3 and osd have their hostname as an additional worker class. This is meant to be solely used for investigation. So you can schedule tests on a specific machine, e.g. with WORKER_CLASS=openqaworker9

Actions

Copy link

#13

Updated by acarvajal almost 4 years ago

okurz wrote:

For reference what I answered in chat:

I suggest to use https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation

Thanks.

What is your current hypothesis? And what experiment did you design to check it?

No current hypothesis. I'm currently blocked with this. Only working detail I have is that test works consistently without issues on our development environment, and it fails sporadically in openqaworker10 before even attempting any HA configuration (i.e., simply boot + zypper up + reboot causes the issue).

As discussed in the chat, I will move this to another worker and monitor, and write & schedule a test module to check the integrity of the qcow2 image when the tests are starting, even though I'm certain that the problem is not being caused by differences in the qcow2 image as it's the same file path in both jobs, and both jobs are running in the same worker.

Whatever is messing the test seems to be coming via an update after running zypper up, which should also be the same for both jobs. If I were to guess, I'd say the updated kernel file is getting corrupted when it is saved in one of the VMs and not in the other.

All worker machines within o3 and osd have their hostname as an additional worker class. This is meant to be solely used for investigation. So you can schedule tests on a specific machine, e.g. with WORKER_CLASS=openqaworker9

Yes, I will try this.

Actions

Copy link

#14

Updated by acarvajal almost 4 years ago

I moved the rolling update tests to openqaworker9 via the qam_ha_rolling_update_node01, qam_ha_rolling_update_node02 & qam_ha_rolling_update_support_server test suites.

Let's see if there's a different outcome in this worker.

Actions

Copy link

#15

Updated by okurz almost 4 years ago

acarvajal wrote:

okurz wrote:

For reference what I answered in chat:

I suggest to use https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation

Thanks.

What is your current hypothesis? And what experiment did you design to check it?

No current hypothesis. I'm currently blocked with this. Only working detail I have is that test works consistently without issues on our development environment, and it fails sporadically in openqaworker10 before even attempting any HA configuration (i.e., simply boot + zypper up + reboot causes the issue).

As discussed in the chat, I will move this to another worker and monitor, and write & schedule a test module to check the integrity of the qcow2 image when the tests are starting, even though I'm certain that the problem is not being caused by differences in the qcow2 image as it's the same file path in both jobs, and both jobs are running in the same worker.

So that means you already have two hypotheses and you have already experiments in mind even if you did not call it such:

H1 The worker machine makes a difference -> E1-1 pin jobs to a special worker and check if the fail ratio significantly differs
H2 The root storage content is corrupted when loading within openQA tests -> E2-1 schedule a test module to check the integrity of the qcow2 image when the tests are starting

Actions

Copy link

#16

Updated by acarvajal almost 4 years ago

First run on openqaworker9 failed:

Node1: https://openqa.suse.de/tests/6323075#step/console_reboot/4 (failing)
Node2: https://openqa.suse.de/tests/6323076 (not failing)

While the last runs on openqaworker10 worked:

Node1: https://openqa.suse.de/tests/6312914
Node2: https://openqa.suse.de/tests/6312913

Will leave this as is for a couple of days to get more results, but initial results are not encouraging.

Actions

Copy link

#17

Updated by acarvajal almost 4 years ago

okurz wrote:

H2 The root storage content is corrupted when loading within openQA tests -> E2-1 schedule a test module to check the integrity of the qcow2 image when the tests are starting

And based on my last message I will have to assign a higher priority to H2. However, I'm certain the root storage content is not corrupted. I think whatever corruption is there, comes after zypper up.

But let's see what the results of E2-1 are before anything else.

Actions

Copy link

#18

Updated by acarvajal almost 4 years ago

I am submitting https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/12791 to record checksums/digests of the HDD_# assets and of the kernel files to see if this can give a better hint of what may be happening in osd with this tests.

@okurz @grifalconi @dzedro could you help me with a review? Thanks in advance.

Actions

Copy link

#19

Updated by acarvajal almost 4 years ago

We have some results from the investigation, and they are inconclusive.

On the passing tests in the development environment, we can see the digest for the qcow2 image is the same right after setup:

Node 1: http://mango.qa.suse.de/tests/4143#step/show_hdd_info/1
Node 2: http://mango.qa.suse.de/tests/4144#step/show_hdd_info/1

hd0_overlay0's digest is different, but this is expected.

Digests remain unchanged after zypper up:

Node 1: http://mango.qa.suse.de/tests/4143#step/show_hdd_info#1/1
Node 2: http://mango.qa.suse.de/tests/4144#step/show_hdd_info#1/1

And kernel digest seems the same between both nodes:

Node 1: http://mango.qa.suse.de/tests/4143#step/check_boot_files/4
Node 2: http://mango.qa.suse.de/tests/4144#step/check_boot_files/4

After this, both nodes successfully reboot and test passes.

On a failing test in openqaworker9, it can be also seen that digest for the disks and the kernel between the 2 nodes are the same:

Node 1: https://openqa.suse.de/tests/6345325#step/show_hdd_info/1
Node 2: https://openqa.suse.de/tests/6345326#step/show_hdd_info/1

Same after zypper up:

Node 1: https://openqa.suse.de/tests/6345325#step/show_hdd_info#1/1
Node 2: https://openqa.suse.de/tests/6345326#step/show_hdd_info#1/1

And kernel after zypper up:

Node 1: https://openqa.suse.de/tests/6345325#step/check_boot_files/4
Node 2: https://openqa.suse.de/tests/6345326#step/check_boot_files/4

However, node 1 failed to boot with the new kernel:

https://openqa.suse.de/tests/6345325#step/console_reboot/4

While node 1 successfully booted:

https://openqa.suse.de/tests/6345326#step/console_reboot/4

So at least it seems the cause is not any disk or file corruption

Actions

Copy link

#20

Updated by dzedro almost 4 years ago

Why should sle-12-SP3-x86_64-ha-rolling_upgrade_update.qcow2 change ... ?

[2021-06-28T15:26:00.645 CEST] [debug] running /usr/bin/qemu-img create -f qcow2 -b /var/lib/openqa/pool/16/sle-12-SP3-x86_64-ha-rolling_upgrade_update.qcow2 /var/lib/openqa/pool/16/raid/hd0-overlay0 16106127360
[2021-06-28T15:26:00.659 CEST] [debug] Formatting '/var/lib/openqa/pool/16/raid/hd0-overlay0', fmt=qcow2 size=16106127360 backing_file=/var/lib/openqa/pool/16/sle-12-SP3-x86_64-ha-rolling_upgrade_update.qcow2

Actions

Copy link

#21

Updated by openqa_review almost 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_update_node01
https://openqa.suse.de/tests/6435426

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The label in the openQA scenario is removed

Actions

Copy link

#22

Updated by acarvajal almost 4 years ago

Issue is also present on 12-SP4 rolling upgrade scenario:

https://openqa.suse.de/tests/6489800
https://openqa.suse.de/tests/6490880#step/console_reboot/4

Also booting from the same 12-SP3 qcow2 image.

So in 12-SP3 job groups the issue is present when doing rolling update (12-SP3 to 12-SP3 with MU updates) and in 12-SP4 when doing rolling upgrade (12-SP3 to 12-SP4 with MU updates). Whenever issue is present, is right after updating the 12-SP3 systems, but before migrating.

Actions

Copy link

#23

Updated by acarvajal almost 4 years ago

Another one: https://openqa.suse.de/tests/6494298#step/console_reboot/4

Actions

Copy link

#24

Updated by openqa_review almost 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_update_node02
https://openqa.suse.de/tests/6647769

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The label in the openQA scenario is removed

Actions

Copy link

#25

Updated by openqa_review almost 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_upgrade_migration_node01
https://openqa.suse.de/tests/6881870

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The label in the openQA scenario is removed

Actions

Copy link

#26

Updated by openqa_review almost 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_update_node01
https://openqa.suse.de/tests/6988954

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The label in the openQA scenario is removed

Actions

Copy link

#27

Updated by okurz over 3 years ago

Priority changed from Normal to High

There are multiple reminder comments and the ticket was brought up in chat by vpelcak and discussed with jmichel. I think it's more efficient if qe-sap takes responsibility and ensures that there are no false positives. One way or the other. Other teams achieve it so qe-sap can as well. "Willing to help" won't cut it but of course QE Tools is "willing to help" as well ;) if you don't know how to proceed just unschedule the tests. Keeping false positives is very harmful, wastes resources and causes alarm fatigue

Actions

Copy link

#28

Updated by maritawerner over 3 years ago

Assignee set to jctmichel

Actions

Copy link

#29

Updated by okurz over 3 years ago

Subject changed from [qem][qe-asg][HA] grub2 error: invalid magic number to [qem][qe-sap][ha] grub2 error: invalid magic number
Assignee deleted (~~jctmichel~~)

Using keyword "qe-sap" as verified by jmichel in weekly QE sync 2021-09-15

Actions

Copy link

#30

Updated by okurz over 3 years ago

Assignee set to jctmichel

Actions

Copy link

#31

Updated by jadamek over 3 years ago

Well, the issue should be fixed now.
I was able to reproduce the issue without the MM config and without the HA modules so nothing related neither HA nor MM.

There was a sporadic issue when the kernel is installed during the 12-SP3 LTSS (maybe it deserves a bug? I still don't understand why we are the only ones who test migrations in QAM)

The solution was to:

Download the qcow2 on my laptop
Boot the SLE12SP3 system
Register system + LTSS + HA
Update
Deregister
Reboot and make sure it rebooted well
Cleanup the system
Upload the qcow2 to openqa

Let's monitor the Test Repo jobgroup until next Monday before closing the ticket.

Actions

Copy link

#32

Updated by jctmichel over 3 years ago

Issue is now fixed and the test is passing.

Actions

Copy link

#33

Updated by jctmichel over 3 years ago

Status changed from New to Resolved

Project

General

Profile

QA (public) » openQA Project (public) » openQA Tests (public)

Tags

Custom queries

action #88273

[qem][qe-sap][ha] grub2 error: invalid magic number

Observation¶

Reproducible¶

Expected result¶

Further details¶

Updated by okurz over 4 years ago

Updated by openqa_review about 4 years ago

Updated by okurz about 4 years ago

Updated by okurz about 4 years ago

Updated by okurz about 4 years ago

Updated by osukup about 4 years ago

Updated by okurz about 4 years ago

Updated by okurz almost 4 years ago

Updated by acarvajal almost 4 years ago

Updated by acarvajal almost 4 years ago

Updated by acarvajal almost 4 years ago

Updated by okurz almost 4 years ago

Updated by acarvajal almost 4 years ago

Updated by acarvajal almost 4 years ago

Updated by okurz almost 4 years ago

Updated by acarvajal almost 4 years ago

Updated by acarvajal almost 4 years ago

Updated by acarvajal almost 4 years ago

Updated by acarvajal almost 4 years ago

Updated by dzedro almost 4 years ago

Updated by openqa_review almost 4 years ago

Updated by acarvajal almost 4 years ago

Updated by acarvajal almost 4 years ago

Updated by openqa_review almost 4 years ago

Updated by openqa_review almost 4 years ago

Updated by openqa_review almost 4 years ago

Updated by okurz over 3 years ago

Updated by maritawerner over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by jadamek over 3 years ago

Updated by jctmichel over 3 years ago

Updated by jctmichel over 3 years ago