Project

General

Profile

action #95458

[qe-sap][ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry

Added by acarvajal 2 months ago. Updated about 22 hours ago.

Status:
Workable
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Target version:
Start date:
2021-07-13
Due date:
2021-09-10
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA tests in HA scenarios (2 or 3 node clusters) fail in different modules due to unexpected reboots
in one or more of the SUTs:

  1. QAM HA qdevice node 1, fails in ha_cluster_init module
  2. QAM HA rolling upgrade migration, node 2, fails in filesystem module
  3. QAM HA hawk/HAProxy node 1, fails in check_after_reboot module
  4. QAM 2 nodes, node 1, fails in ha_cluster_init module

Test suite description

The base test suite is used for job templates defined in YAML documents. It has no settings of its own.

Reproducible

Issue is very sporadic, and reproducing it is not always possible. Usually, re-triggering the jobs lead to the tests passing.

For example, from the jobs above, re-triggered jobs succeeded:

  1. https://openqa.suse.de/tests/6435165
  2. https://openqa.suse.de/tests/6435380
  3. https://openqa.suse.de/tests/6435384
  4. https://openqa.suse.de/tests/6435389

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#95458

Expected result

  1. Last good: :20349:samba (or more recent)
  2. Last good: :20121:crmsh
  3. Last good: 20321:kernel-ec2
  4. Last good: MR:244261:crmsh

Related issues

Related to openQA Tests - action #95788: [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retryNew2021-07-21

Related to openQA Tests - action #94171: [qem][sap] test fails in check_logs about 50% of timesRejected2021-06-17

Related to openQA Tests - action #97013: [qe-core][qe-yast] test fails in handle_reboot, patch_and_reboot, installationNew2021-08-17

History

#1 Updated by maritawerner 2 months ago

  • Subject changed from SUT reboots unexpectedly, leading to tests failing in HA scenarios to [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios

#2 Updated by szarate 2 months ago

  • Project changed from openQA Tests to openQA Project
  • Category changed from Bugs in existing tests to Concrete Bugs

Another example can be: https://openqa.suse.de/tests/6410169#step/filesystem/28

After checking the logs, and referencing with the system journal the only rough hint I get is:

Jul 11 01:36:26 openqaworker5 kernel: kvm [38069]: vcpu0, guest rIP: 0xffffffff924776b8 disabled perfctr wrmsr: 0xc2 data 0xffff

which corresponds more or less to the last time the test ran one of those commands... I see no coredumps whatsoever... but that message is a bit puzzling (they repeat every now and then too inside the worker, but other jobs don't have the problem apparently)

PS: Moved to the openQA project for now, although I'm torn between infraestructure or this project itself

#3 Updated by MDoucha 2 months ago

  • Project changed from openQA Project to openQA Tests
  • Category deleted (Concrete Bugs)

My first guess would be that the test somehow gets into a kernel panic. Add the ignore_level kernel command line parameter to grub.cfg during incident installation to see kernel backtraces in serial console logs. Here's an example how LTP adds the ignore_level kernel parameter:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/kernel/install_ltp.pm#L353

#4 Updated by MDoucha 2 months ago

  • Project changed from openQA Tests to openQA Project
  • Category set to Concrete Bugs

Oops, sorry for overwriting some metadata.

#5 Updated by okurz 2 months ago

  • Category changed from Concrete Bugs to Support
  • Status changed from New to Feedback
  • Assignee set to okurz
  • Target version set to Ready

hm, if there would be a kernel panic than the serial log should show at least something. But the system acts like during a forced power reset. https://openqa.suse.de/tests/6410169/logfile?filename=serial0.txt mentions as last token "Wf8kQ-0-" before the next command is executed but there is nothing after that token in the serial log.

I also manually checked the video from https://openqa.suse.de/tests/6410169/file/video.ogv and stepped through the frames one by one and have not found anything between the healthy bash session like in https://openqa.suse.de/tests/6410169#step/filesystem/28 and the grub menu on boot like https://openqa.suse.de/tests/6410169#step/filesystem/29

According to https://bugs.centos.org/view.php?id=6730 and https://bugzilla.redhat.com/show_bug.cgi?id=507085 messages like "kvm: vcpu0, guest rIP disabled perfctr wrmsr" are considered harmless. I doubt they are related to the problems we see.

acarvajal I suggest you try the kernel command line parameters that mdoucha suggested. Also you could try if QEMUCPU=host makes any difference, just a hunch :)

EDIT: If you like others to pickup I suggest you try to come up with "steps to reproduce", e.g. an openqa-cli api -X post isos command line to trigger a safe set of jobs that do not interfer with production for crosschecking. Then we could potentially also ask someone else from the tools team to take over.

#6 Updated by okurz 2 months ago

  • Due date set to 2021-07-30

#7 Updated by acarvajal 2 months ago

okurz wrote:

acarvajal I suggest you try the kernel command line parameters that mdoucha suggested. Also you could try if QEMUCPU=host makes any difference, just a hunch :)

I have added QEMUCPU=host in the tests that run in the HANA validation openQA instance. Last time there were 3 failures with this issue out of 200+ jobs, so I guess it is a good place to try. I should see some results in the next runs on Monday.

BTW, interesting hunch. I am not seeing this issue in Power9 (100+ jobs in that same openQA instace), which made think that whatever's causing it could be related to qemu. I'll come back with some results next Monday.

I will also begin planning to introduce mdoucha's suggestions to gather more logs for the tests in osd.

#8 Updated by okurz 2 months ago

acarvajal wrote:

okurz wrote:

acarvajal I suggest you try the kernel command line parameters that mdoucha suggested. Also you could try if QEMUCPU=host makes any difference, just a hunch :)

I have added QEMUCPU=host in the tests that run in the HANA validation openQA instance. Last time there were 3 failures with this issue out of 200+ jobs, so I guess it is a good place to try. I should see some results in the next runs on Monday.

What I thought of is to not change the production tests but rather trigger an additional, dedicated test set, e.g. following https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation and after the weekend look at the corresponding test overview page to quickly get the overview of how many tests passed/failed.

#9 Updated by acarvajal 2 months ago

I had to restart 2 tests with the failure yesterday in the HANA validation openQA instance.

So the difference between using QEMUCPU=host or not was 3 out of 200+ last week to 2 out of 200+ this week. I don't think this is statistically relevant, and the bad news is that the issue is still present.

I would look into implementing mdoucha's suggestions and triggering some additional jobs with and without QEMUCPU=host to do a more thorough analysis.

#10 Updated by okurz 2 months ago

  • Subject changed from [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios to [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha/filesystem.pm:81.*command.*mkfs -t ocfs2.*timed out":retry

I have seen the symptom

[2021-07-11T01:38:13.388 CEST] [info] ::: basetest::runtest: # Test died: command 'mkfs -t ocfs2 -F -N 16 "/dev/vg_cluster_md/lv_openqa"' timed out at /usr/lib/os-autoinst/testapi.pm line 959.

likely more than once – or I keep getting back to the same job ;) – still, I am trying with auto-review to automatically detect such cases and retrigger according tests. The regex should be specific enough to only catch HAS tests but at any time it can be generalized or extended to cover alternative symptoms. quoting acarvajal "hope you can find an easy way to detect those cases .... it's not easy as it can happen anytime and in any module during the test" but we should start somewhere :)

#11 Updated by acarvajal 2 months ago

okurz wrote:

I have seen the symptom

[2021-07-11T01:38:13.388 CEST] [info] ::: basetest::runtest: # Test died: command 'mkfs -t ocfs2 -F -N 16 "/dev/vg_cluster_md/lv_openqa"' timed out at /usr/lib/os-autoinst/testapi.pm line 959.

likely more than once – or I keep getting back to the same job ;) – still, I am trying with auto-review to automatically detect such cases and retrigger according tests. The regex should be specific enough to only catch HAS tests but at any time it can be generalized or extended to cover alternative symptoms. quoting acarvajal "hope you can find an easy way to detect those cases .... it's not easy as it can happen anytime and in any module during the test" but we should start somewhere :)

EDIT: Tested with echo https://openqa.suse.de/tests/6410236 | env host=openqa.suse.de openqa-investigate with openqa-investigate from github.com/os-autoinst/scripts/ . But openqa-investigate currently does not support cloning jobs that are part of a multi-machine cluster, see https://github.com/os-autoinst/scripts/commit/371467dafcefb9182530c790c33632f8cfa9a297#diff-f73cf39a07f6cf8cdb453862496919d06df16d07e58b274e68ea148dd1f7dae5

That's one of the symptoms.

I'd say whenever SUT is unexpectedly rebooted, tests will fail in one of two ways depending on what the test module was doing:

  1. A time out in an assert_script_run (or similar), such as the sympton in this filesystem test module failure.
  2. A failure in assert_screen such as in https://openqa.suse.de/tests/6426340#step/check_after_reboot/5

Since the majority of the HA test modules rely either on the root_console or the serial terminal, I think the first case will be more common, but I don't know if having a general rule to re-start tests when commands time out is safe.

#12 Updated by okurz 2 months ago

  • Subject changed from [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha/filesystem.pm:81.*command.*mkfs -t ocfs2.*timed out":retry to [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*command.*mkfs -t ocfs2.*timed out.*match=root-console timed out":retry

#13 Updated by okurz 2 months ago

Tested the auto-review regex with

$ echo https://openqa.suse.de/tests/6410225 | env dry_run=1 host=openqa.suse.de ./openqa-label-known-issues
openqa-cli api --host https://openqa.suse.de -X POST jobs/6410225/comments text=poo#95458 [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*command.*mkfs -t ocfs2.*timed out.*match=root-console timed out":retry
openqa-cli api --host https://openqa.suse.de -X POST jobs/6410225/restart

so a comment would have been written and the test should have been restarted, assuming this works this way over the API for multi-machine clusters.

#14 Updated by okurz 2 months ago

  • Subject changed from [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*command.*mkfs -t ocfs2.*timed out.*match=root-console timed out":retry to [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry

#15 Updated by okurz 2 months ago

  • Related to action #95788: [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry added

#16 Updated by okurz 2 months ago

  • Description updated (diff)

#17 Updated by okurz about 2 months ago

  • Project changed from openQA Project to openQA Tests
  • Subject changed from [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry to [ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry
  • Category changed from Support to Bugs in existing tests
  • Status changed from Feedback to Workable
  • Assignee changed from okurz to acarvajal

the auto-review regex matching might be a bit too broad as it also catches issues like https://openqa.suse.de/tests/6581119#step/iscsi_client/18 where the test fails in iscsi and then also the post_fail_hook fails to select a free root terminal. However this is all within the scope of tests/ha so I will leave this to you again. As followup to #95458#note-2 , sorry, I don't see how this is a problem with openQA itself.

#18 Updated by MDoucha about 2 months ago

MDoucha wrote:

My first guess would be that the test somehow gets into a kernel panic. Add the ignore_level kernel command line parameter to grub.cfg during incident installation to see kernel backtraces in serial console logs. Here's an example how LTP adds the ignore_level kernel parameter:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/kernel/install_ltp.pm#L353

Correction of my suggestion: the kernel command line parameter is actually ignore_loglevel. Also updating the link above to permalink:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/49342da7528b6bc0a8b418090487bc40c7f8e4ce/tests/kernel/install_ltp.pm#L359

#19 Updated by okurz about 2 months ago

  • Due date changed from 2021-07-30 to 2021-08-06
  • Target version changed from Ready to future

acarvajal are you ok to continue here?

#20 Updated by okurz about 2 months ago

  • Related to action #94171: [qem][sap] test fails in check_logs about 50% of times added

#21 Updated by acarvajal about 1 month ago

  • Project changed from openQA Tests to openQA Project
  • Due date changed from 2021-08-06 to 2021-09-10
  • Category deleted (Bugs in existing tests)
  • Status changed from Workable to Feedback
  • Assignee changed from acarvajal to okurz
  • Target version changed from future to Ready

okurz wrote:

acarvajal are you ok to continue here?

Yes. I think so. I will probably sync with you before doing so though.

#22 Updated by acarvajal about 1 month ago

  • Project changed from openQA Project to openQA Tests
  • Category set to Bugs in existing tests
  • Status changed from Feedback to Workable
  • Assignee changed from okurz to acarvajal
  • Target version changed from Ready to future

#23 Updated by szarate about 1 month ago

  • Related to action #97013: [qe-core][qe-yast] test fails in handle_reboot, patch_and_reboot, installation added

#24 Updated by openqa_review 26 days ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: migration_offline_scc_verify_sle15sp1_ltss_ha_alpha_node02
https://openqa.suse.de/tests/6958789

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The label in the openQA scenario is removed

#25 Updated by okurz 8 days ago

  • Subject changed from [ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry to [qe-sap][ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry

Using keyword "qe-sap" as verified by jmichel in weekly QE sync 2021-09-15

#26 Updated by okurz about 22 hours ago

this ticket is exceeding its due-date. It popped up during the weekly QE sync 2021-09-22. We would appreciate a reaction within the next days, at least updating the due-date according to what we can realistically expect. See https://progress.opensuse.org/projects/openqatests/wiki#SLOs-service-level-objectives for details

Also available in: Atom PDF