Project

General

Profile

Actions

action #54764

closed

[ha][migration] migration_verify_sle11sp4_ha_alpha_node02 fails in check_after_reboot

Added by acarvajal over 4 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
2019-07-29
Due date:
% Done:

100%

Estimated time:
Difficulty:

Description

Starting around 08.07.2019 and 10.07.2019, HA migration tests from 11-SP4 to 12-SP5 (tests migration_verify_sle11sp4_ha_alpha_node01 and migration_verify_sle11sp4_ha_alpha_node02) are failing due to VMs restart after cluster is started on the cluster nodes.

Due to the step where the test is failing, it has not been possible to gather logs so far. Still working on a way to accomplish this.

Jobs previous to that date were successfully completed:

Node1 Upgrade: https://openqa.suse.de/tests/3041853
Node2 Upgrade: https://openqa.suse.de/tests/3041206
Node1: https://openqa.suse.de/tests/3041856
Node2: https://openqa.suse.de/tests/3041855
Support Server: https://openqa.suse.de/tests/3041854

However, after that date, the Node Upgrade tests still work:

Node1 Upgrade: https://openqa.suse.de/tests/3161942
Node2 Upgrade: https://openqa.suse.de/tests/3161939

While the cluster tests fail with the problem described above:

Node1: https://openqa.suse.de/tests/3161945
Node2: https://openqa.suse.de/tests/3161944
Support Server: https://openqa.suse.de/tests/3161943

(In the video for Node1 it can also be seen that this node is also restarting)

This issue is only affecting the 11-SP4 to 12-SP5 HA migration scenario. All other migration scenarios (12-SP2-LTSS, 12-SP3-LTSS, 12-SP4, etc.) are working.

Do not think this is related to the build itself, as the same test in our development environment is working using the qcow2 images generated by openqa.suse.de's upgrade tests:

Node1: http://mango.suse.de/tests/1169
Node2: http://mango.suse.de/tests/1170
Support Server: http://mango.suse.de/tests/1168

Also performed manual testing, and was not able to reproduce the issue.

Think this could be related to the upgrade of the x86_64 workers to Leap 15.1 as the workers in our development environment are using SLES 12-SP3.

Could not find the exact date of the migration to Leap 15.1 in: https://confluence.suse.com/pages/viewpage.action?pageId=194052156, but think it was on 09.07.2019.

Actions #1

Updated by okurz over 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: migration_verify_sle11sp4_ha_alpha_node02
https://openqa.suse.de/tests/3245942

Actions #2

Updated by okurz over 4 years ago

  • Project changed from openQA Infrastructure to openQA Tests
  • Subject changed from migration_verify_sle11sp4_ha_alpha_node02 fails in check_after_reboot to [ha][migration] migration_verify_sle11sp4_ha_alpha_node02 fails in check_after_reboot
  • Category set to Bugs in existing tests

are you sure this infrastructure related? https://openqa.suse.de/tests/3266433#step/check_after_reboot/3 shows simply that the test expects the root console but the grub screen expects a key press. Didn't you or ldevulder recently work on changes regarding grub menu handling? Btw, do you still consider yourself parts of the QA tools team? Haven't seen you there since I joined and you are still mentioned on the wiki page for the team :)

Actions #3

Updated by acarvajal over 4 years ago

okurz wrote:

are you sure this infrastructure related? https://openqa.suse.de/tests/3266433#step/check_after_reboot/3 shows simply that the test expects the root console but the grub screen expects a key press.

I am sure the cluster nodes weren't restarting in that step when osd workers were at 42.3, and I'm sure that the same test with the same qcow2 images (generated by the upgrade tests in osd) is working on a current version of openqa running on SLES 12-SP3 workers:

http://mango.suse.de/tests/1340
http://mango.suse.de/tests/1341
http://mango.suse.de/tests/1339

I did try modifying the test with a '$self->wait_boot()' for that scenario, but even when the test successfully handles that first reboot, the nodes reboot again.

My current guess (and it's just a guess), is that this could be related to the qemu version.

As to your specific question: no, I'm not sure this is purely infrastructure related. I'm not seeing the same issue in other migration scenarios (this only happens in the 11sp4 to 12sp5 migration scenario), and even then, only after the cluster services are started. But as said above, this is not happening with openqa workers in SLES 12-SP3 and (more importantly), couldn't reproduce the issue manually.

Didn't you or ldevulder recently work on changes regarding grub menu handling?

Only for IPMI. This runs only on QEMU.

Btw, do you still consider yourself parts of the QA tools team? Haven't seen you there since I joined and you are still mentioned on the wiki page for the team :)

Yes, per my goals, 20% of my time is dedicated to QA tools team tasks. I've been keeping an eye in #qa-tools in IRC and in #testing in RC, and checking the backlog, but if there's a different QA tools team channel in RC I wasn't invited. This is precisely the reason I tried to help you and coolo earlier this week when you dropped in #qa-tools in IRC, I finally saw activity there.

So, let me know how/if I can help.

Actions #4

Updated by okurz over 4 years ago

Without trying to tell you what do then I suggest you are the best candidate to evaluate further what in detail could have caused the problem, e.g. qemu as you mentioned :)

Actions #5

Updated by acarvajal over 4 years ago

  • Assignee set to acarvajal
Actions #6

Updated by okurz over 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: migration_verify_sle11sp4_ha_alpha_node02
https://openqa.suse.de/tests/3359639

Actions #7

Updated by okurz over 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: migration_verify_sle11sp4_ha_alpha_node02
https://openqa.suse.de/tests/3421046

Actions #8

Updated by acarvajal over 4 years ago

Seeing as this is not leaving any type of usable log with something pointing to an issue, decided to re-trigger the failing tests and take control of both nodes in the ha/check_after_reboot module.

After handling the boot manually and waiting for the cluster to start, waited for some minutes and didn't experience any reboots.

With this in mind, experimented scheduling a boot/boot_to_desktop between ha/upgrade_from_sle11sp4_workarounds and ha/check_after_reboot. This helped in the sense that boot/boot_to_desktop handled the reboot, but then there was a second reboot at the start of ha/check_after_reboot.

Was able to get the tests passing on build 0152@0346 of 12-SP5 (https://openqa.suse.de/tests/3452168 & https://openqa.suse.de/tests/3452167) by taking manual control of both nodes on ha/check_after_reboot and on ha/check_hawk and handling the reboots at the start of each module manually. I guess this means that it may be possible to do the same by adding a $self->wait_boot at the start of those 2 modules; however this doesn't seem like a good idea as the nodes should not be restarting on that step and having code there to handle unexpected reboots could mask future issues. Also, the same test is working without any unexpected reboots in the lab (http://mango.suse.de/tests/1725 & http://mango.suse.de/tests/1726) on workers on SLES 12-SP3 instead of Leap 15.1.

Will continue investigating this by having a look at autotest searching for clues as to what may be causing the reboot in this scenario.

Actions #9

Updated by acarvajal over 4 years ago

  • Project changed from openQA Tests to openQA Infrastructure
  • Category deleted (Bugs in existing tests)
  • Status changed from New to Closed
  • Assignee deleted (acarvajal)
  • % Done changed from 0 to 100

Was able to narrow down the VM reboot to the make_snapshot() call in os-autoinst::autotest, which is called between test modules execution when snapshot are available.

Tested successfully with QEMU_DISABLE_SNAPSHOT=1 in openqa.suse.de: https://openqa.suse.de/tests/3466753 & https://openqa.suse.de/tests/3466752

This still does not explain why the test was working with the snapshots enabled when the workers were in Leap 42.3. I figure a combination of the qcow2 images in use (which were generated manually) and the version of either os-autoinst or qemu could be the cause of the issue.

Since this is not a widespread issue, and there is a workaround for the tests, I will be closing this.

Actions #10

Updated by acarvajal over 4 years ago

  • Project changed from openQA Infrastructure to openQA Tests
Actions

Also available in: Atom PDF