Project

General

Profile

Actions

action #95458

open

[qe-sap][ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry

Added by acarvajal almost 3 years ago. Updated 5 months ago.

Status:
Feedback
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
Start date:
2021-07-13
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA tests in HA scenarios (2 or 3 node clusters) fail in different modules due to unexpected reboots
in one or more of the SUTs:

  1. QAM HA qdevice node 1, fails in ha_cluster_init module
  2. QAM HA rolling upgrade migration, node 2, fails in filesystem module
  3. QAM HA hawk/HAProxy node 1, fails in check_after_reboot module
  4. QAM 2 nodes, node 1, fails in ha_cluster_init module

Test suite description

The base test suite is used for job templates defined in YAML documents. It has no settings of its own.

Reproducible

Issue is very sporadic, and reproducing it is not always possible. Usually, re-triggering the jobs lead to the tests passing.

For example, from the jobs above, re-triggered jobs succeeded:

  1. https://openqa.suse.de/tests/6435165
  2. https://openqa.suse.de/tests/6435380
  3. https://openqa.suse.de/tests/6435384
  4. https://openqa.suse.de/tests/6435389

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#95458

Expected result

  1. Last good: :20349:samba (or more recent)
  2. Last good: :20121:crmsh
  3. Last good: 20321:kernel-ec2
  4. Last good: MR:244261:crmsh

Related issues 3 (1 open2 closed)

Related to openQA Tests - action #95788: [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network Feedback2021-07-21

Actions
Related to openQA Tests - action #94171: [qem][sap] test fails in check_logs about 50% of timesRejected2021-06-17

Actions
Related to openQA Tests - action #97013: [qe-core][qe-yast] test fails in handle_reboot, patch_and_reboot, installationResolved2021-08-17

Actions
Actions #1

Updated by maritawerner almost 3 years ago

  • Subject changed from SUT reboots unexpectedly, leading to tests failing in HA scenarios to [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios
Actions #2

Updated by szarate almost 3 years ago

  • Project changed from openQA Tests to openQA Project
  • Category changed from Bugs in existing tests to Regressions/Crashes

Another example can be: https://openqa.suse.de/tests/6410169#step/filesystem/28

After checking the logs, and referencing with the system journal the only rough hint I get is:

Jul 11 01:36:26 openqaworker5 kernel: kvm [38069]: vcpu0, guest rIP: 0xffffffff924776b8 disabled perfctr wrmsr: 0xc2 data 0xffff

which corresponds more or less to the last time the test ran one of those commands... I see no coredumps whatsoever... but that message is a bit puzzling (they repeat every now and then too inside the worker, but other jobs don't have the problem apparently)

PS: Moved to the openQA project for now, although I'm torn between infraestructure or this project itself

Actions #3

Updated by MDoucha almost 3 years ago

  • Project changed from openQA Project to openQA Tests
  • Category deleted (Regressions/Crashes)

My first guess would be that the test somehow gets into a kernel panic. Add the ignore_level kernel command line parameter to grub.cfg during incident installation to see kernel backtraces in serial console logs. Here's an example how LTP adds the ignore_level kernel parameter:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/kernel/install_ltp.pm#L353

Actions #4

Updated by MDoucha almost 3 years ago

  • Project changed from openQA Tests to openQA Project
  • Category set to Regressions/Crashes

Oops, sorry for overwriting some metadata.

Actions #5

Updated by okurz almost 3 years ago

  • Category changed from Regressions/Crashes to Support
  • Status changed from New to Feedback
  • Assignee set to okurz
  • Target version set to Ready

hm, if there would be a kernel panic than the serial log should show at least something. But the system acts like during a forced power reset. https://openqa.suse.de/tests/6410169/logfile?filename=serial0.txt mentions as last token "Wf8kQ-0-" before the next command is executed but there is nothing after that token in the serial log.

I also manually checked the video from https://openqa.suse.de/tests/6410169/file/video.ogv and stepped through the frames one by one and have not found anything between the healthy bash session like in https://openqa.suse.de/tests/6410169#step/filesystem/28 and the grub menu on boot like https://openqa.suse.de/tests/6410169#step/filesystem/29

According to https://bugs.centos.org/view.php?id=6730 and https://bugzilla.redhat.com/show_bug.cgi?id=507085 messages like "kvm: vcpu0, guest rIP disabled perfctr wrmsr" are considered harmless. I doubt they are related to the problems we see.

@acarvajal I suggest you try the kernel command line parameters that mdoucha suggested. Also you could try if QEMUCPU=host makes any difference, just a hunch :)

EDIT: If you like others to pickup I suggest you try to come up with "steps to reproduce", e.g. an openqa-cli api -X post isos command line to trigger a safe set of jobs that do not interfer with production for crosschecking. Then we could potentially also ask someone else from the tools team to take over.

Actions #6

Updated by okurz almost 3 years ago

  • Due date set to 2021-07-30
Actions #7

Updated by acarvajal almost 3 years ago

okurz wrote:

@acarvajal I suggest you try the kernel command line parameters that mdoucha suggested. Also you could try if QEMUCPU=host makes any difference, just a hunch :)

I have added QEMUCPU=host in the tests that run in the HANA validation openQA instance. Last time there were 3 failures with this issue out of 200+ jobs, so I guess it is a good place to try. I should see some results in the next runs on Monday.

BTW, interesting hunch. I am not seeing this issue in Power9 (100+ jobs in that same openQA instace), which made think that whatever's causing it could be related to qemu. I'll come back with some results next Monday.

I will also begin planning to introduce mdoucha's suggestions to gather more logs for the tests in osd.

Actions #8

Updated by okurz almost 3 years ago

acarvajal wrote:

okurz wrote:

@acarvajal I suggest you try the kernel command line parameters that mdoucha suggested. Also you could try if QEMUCPU=host makes any difference, just a hunch :)

I have added QEMUCPU=host in the tests that run in the HANA validation openQA instance. Last time there were 3 failures with this issue out of 200+ jobs, so I guess it is a good place to try. I should see some results in the next runs on Monday.

What I thought of is to not change the production tests but rather trigger an additional, dedicated test set, e.g. following https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation and after the weekend look at the corresponding test overview page to quickly get the overview of how many tests passed/failed.

Actions #9

Updated by acarvajal almost 3 years ago

I had to restart 2 tests with the failure yesterday in the HANA validation openQA instance.

So the difference between using QEMUCPU=host or not was 3 out of 200+ last week to 2 out of 200+ this week. I don't think this is statistically relevant, and the bad news is that the issue is still present.

I would look into implementing mdoucha's suggestions and triggering some additional jobs with and without QEMUCPU=host to do a more thorough analysis.

Actions #10

Updated by okurz almost 3 years ago

  • Subject changed from [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios to [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha/filesystem.pm:81.*command.*mkfs -t ocfs2.*timed out":retry

I have seen the symptom

[2021-07-11T01:38:13.388 CEST] [info] ::: basetest::runtest: # Test died: command 'mkfs -t ocfs2 -F -N 16 "/dev/vg_cluster_md/lv_openqa"' timed out at /usr/lib/os-autoinst/testapi.pm line 959.

likely more than once – or I keep getting back to the same job ;) – still, I am trying with auto-review to automatically detect such cases and retrigger according tests. The regex should be specific enough to only catch HAS tests but at any time it can be generalized or extended to cover alternative symptoms. quoting @acarvajal "hope you can find an easy way to detect those cases .... it's not easy as it can happen anytime and in any module during the test" but we should start somewhere :)

Actions #11

Updated by acarvajal almost 3 years ago

okurz wrote:

I have seen the symptom

[2021-07-11T01:38:13.388 CEST] [info] ::: basetest::runtest: # Test died: command 'mkfs -t ocfs2 -F -N 16 "/dev/vg_cluster_md/lv_openqa"' timed out at /usr/lib/os-autoinst/testapi.pm line 959.

likely more than once – or I keep getting back to the same job ;) – still, I am trying with auto-review to automatically detect such cases and retrigger according tests. The regex should be specific enough to only catch HAS tests but at any time it can be generalized or extended to cover alternative symptoms. quoting @acarvajal "hope you can find an easy way to detect those cases .... it's not easy as it can happen anytime and in any module during the test" but we should start somewhere :)

EDIT: Tested with echo https://openqa.suse.de/tests/6410236 | env host=openqa.suse.de openqa-investigate with openqa-investigate from github.com/os-autoinst/scripts/ . But openqa-investigate currently does not support cloning jobs that are part of a multi-machine cluster, see https://github.com/os-autoinst/scripts/commit/371467dafcefb9182530c790c33632f8cfa9a297#diff-f73cf39a07f6cf8cdb453862496919d06df16d07e58b274e68ea148dd1f7dae5

That's one of the symptoms.

I'd say whenever SUT is unexpectedly rebooted, tests will fail in one of two ways depending on what the test module was doing:

  1. A time out in an assert_script_run (or similar), such as the sympton in this filesystem test module failure.
  2. A failure in assert_screen such as in https://openqa.suse.de/tests/6426340#step/check_after_reboot/5

Since the majority of the HA test modules rely either on the root_console or the serial terminal, I think the first case will be more common, but I don't know if having a general rule to re-start tests when commands time out is safe.

Actions #12

Updated by okurz almost 3 years ago

  • Subject changed from [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha/filesystem.pm:81.*command.*mkfs -t ocfs2.*timed out":retry to [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*command.*mkfs -t ocfs2.*timed out.*match=root-console timed out":retry
Actions #13

Updated by okurz almost 3 years ago

Tested the auto-review regex with

$ echo https://openqa.suse.de/tests/6410225 | env dry_run=1 host=openqa.suse.de ./openqa-label-known-issues
openqa-cli api --host https://openqa.suse.de -X POST jobs/6410225/comments text=poo#95458 [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*command.*mkfs -t ocfs2.*timed out.*match=root-console timed out":retry
openqa-cli api --host https://openqa.suse.de -X POST jobs/6410225/restart

so a comment would have been written and the test should have been restarted, assuming this works this way over the API for multi-machine clusters.

Actions #14

Updated by okurz almost 3 years ago

  • Subject changed from [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*command.*mkfs -t ocfs2.*timed out.*match=root-console timed out":retry to [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry
Actions #15

Updated by okurz almost 3 years ago

  • Related to action #95788: [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network added
Actions #16

Updated by okurz almost 3 years ago

  • Description updated (diff)
Actions #17

Updated by okurz over 2 years ago

  • Project changed from openQA Project to openQA Tests
  • Subject changed from [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry to [ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry
  • Category changed from Support to Bugs in existing tests
  • Status changed from Feedback to Workable
  • Assignee changed from okurz to acarvajal

the auto-review regex matching might be a bit too broad as it also catches issues like https://openqa.suse.de/tests/6581119#step/iscsi_client/18 where the test fails in iscsi and then also the post_fail_hook fails to select a free root terminal. However this is all within the scope of tests/ha so I will leave this to you again. As followup to #95458#note-2 , sorry, I don't see how this is a problem with openQA itself.

Actions #18

Updated by MDoucha over 2 years ago

MDoucha wrote:

My first guess would be that the test somehow gets into a kernel panic. Add the ignore_level kernel command line parameter to grub.cfg during incident installation to see kernel backtraces in serial console logs. Here's an example how LTP adds the ignore_level kernel parameter:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/kernel/install_ltp.pm#L353

Correction of my suggestion: the kernel command line parameter is actually ignore_loglevel. Also updating the link above to permalink:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/49342da7528b6bc0a8b418090487bc40c7f8e4ce/tests/kernel/install_ltp.pm#L359

Actions #19

Updated by okurz over 2 years ago

  • Due date changed from 2021-07-30 to 2021-08-06
  • Target version changed from Ready to future

@acarvajal are you ok to continue here?

Actions #20

Updated by okurz over 2 years ago

  • Related to action #94171: [qem][sap] test fails in check_logs about 50% of times added
Actions #21

Updated by acarvajal over 2 years ago

  • Project changed from openQA Tests to openQA Project
  • Due date changed from 2021-08-06 to 2021-09-10
  • Category deleted (Bugs in existing tests)
  • Status changed from Workable to Feedback
  • Assignee changed from acarvajal to okurz
  • Target version changed from future to Ready

okurz wrote:

@acarvajal are you ok to continue here?

Yes. I think so. I will probably sync with you before doing so though.

Actions #22

Updated by acarvajal over 2 years ago

  • Project changed from openQA Project to openQA Tests
  • Category set to Bugs in existing tests
  • Status changed from Feedback to Workable
  • Assignee changed from okurz to acarvajal
  • Target version changed from Ready to future
Actions #23

Updated by szarate over 2 years ago

  • Related to action #97013: [qe-core][qe-yast] test fails in handle_reboot, patch_and_reboot, installation added
Actions #24

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: migration_offline_scc_verify_sle15sp1_ltss_ha_alpha_node02
https://openqa.suse.de/tests/6958789

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The label in the openQA scenario is removed
Actions #25

Updated by okurz over 2 years ago

  • Subject changed from [ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry to [qe-sap][ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry

Using keyword "qe-sap" as verified by jmichel in weekly QE sync 2021-09-15

Actions #26

Updated by okurz over 2 years ago

this ticket is exceeding its due-date. It popped up during the weekly QE sync 2021-09-22. We would appreciate a reaction within the next days, at least updating the due-date according to what we can realistically expect. See https://progress.opensuse.org/projects/openqatests/wiki#SLOs-service-level-objectives for details

Actions #27

Updated by acarvajal over 2 years ago

  • Due date changed from 2021-09-10 to 2021-10-15
Actions #28

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_qdevice_node2
https://openqa.suse.de/tests/7350101

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234
Actions #29

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_qdevice_node2
https://openqa.suse.de/tests/7393110

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234
Actions #30

Updated by okurz over 2 years ago

so we discussed in a meeting. Thank you for inviting me :)

As this is a ticket about the tests themselves I suggest that you within qe-sap drive the work. And from the tools team we will be able to provide support and stand ready for collaboration. If you don't plan to have the issue fixed soon the according tests can also be unscheduled according to the QAM processes until (hopefully) eventually the tests can be brought back. One additional observation from my side: Some months ago we still had highly stable multi-machine tests for the areas "network" as well as "HPC" and "SAP" but it seems all three areas have not received a lot of love. There are multiple other areas where multi-machine tests are running just fine so I am not aware of any generic openQA problems. What I see are limited to aforementioned areas hence it's unlikely that the issues will go away until explicitly addressed because they are domain-specific. Based on my former experiences those issues could very well point to valid product regressions.
And https://openqa.suse.de/tests?match=wicked_ shows that openQA multi-machine tests relying on the network can work very reliably. No network related problem showing up there. A quick SQL query revealed that for the fail ratio of "wicked" tests within the last 10 days is 0.6% so very low. This shows that openQA multi-machine tests can be very stable and also shows that we don't have a generic problem in our tooling or infrastructure. Regardless if it's sporadic product issues or test design flaws at this point I recommend to focus on mitigating the negative impact on the openQA review procedures in short-term and remove the according tests from the schedule. Also see https://confluence.suse.com/display/openqa/QAM+openQA+review+guide as reference. During the time the tests are not within the validation schedule QE-SAP would need to ensure by other means that the product quality is sufficient.

@acarvajal does it make sense for you to stay assigned to this ticket?

@rbranco as direct followup to what we just discussed, have you seen #95458#note-29 ? I suggest to remove ha_qdevice_node2 and all related test suites from the validation job groups until the problem can be fixed. Can you create an according merge request for the test schedule in https://gitlab.suse.de/qa-maintenance/qam-openqa-yml ?

Actions #31

Updated by acarvajal over 2 years ago

  • Assignee changed from acarvajal to rbranco

okurz wrote:

And https://openqa.suse.de/tests?match=wicked_ shows that openQA multi-machine tests relying on the network can work very reliably. No network related problem showing up there.

This issue is definitely not network-related. 100% agreement with you there. The ticket that was tracking network issues in HA QAM tests is: https://progress.opensuse.org/issues/95788

This ticket tracks instead the sporadic failure where the SUT VM reboots when test code is not expecting it, and even if this is more frequent in MM scenarios, it can also happen on SM scenarios.

And this is not limited only to osd. In openqa.wdf.sap.corp I've noticed the same issue in:

  • 9 jobs out of 384 jobs total for this week.
  • 4 jobs out of 384 jobs total for the week starting on 18.10.2021.
  • 4 jobs out of 384 jobs total for the week starting on 11.10.2021.
  • 12 jobs out of 384 jobs total for the week starting on 4.10.2021

All these jobs are running with QEMUCPU=host since it was suggested in https://progress.opensuse.org/issues/95458#note-5, but even without that setting, failure rate was more or less the same. It is a low failure rate though, at around 1.8%.

Restarting all these tests results in them passing. Hence no other assumption to make except that a transient race condition is to blame for these failures, and not something related to the tests themselves.

A quick SQL query revealed that for the fail ratio of "wicked" tests within the last 10 days is 0.6% so very low.

What would be the fail ratio of "qdevice" and "qnetd" in the same period? Can you point me to where/how I can get that data myself?

@acarvajal does it make sense for you to stay assigned to this ticket?

I do not think so. I asked @jmichel who should I assign it to, so I am assigning this to @rbranco.

@rbranco as direct followup to what we just discussed, have you seen #95458#note-29 ? I suggest to remove ha_qdevice_node2 and all related test suites from the validation job groups until the problem can be fixed. Can you create an according merge request for the test schedule in https://gitlab.suse.de/qa-maintenance/qam-openqa-yml ?

Following up on the "wicked" exercise from above, if we go to: https://openqa.suse.de/tests?match=ha_qdevice

We'll see that out of the last 500 qdevice jobs finished in osd, there are at the time of this writing:

343 passed, 109 softfailed, 12 failed, 6 skipped, 3 incomplete and 27 parallel failed

All 3 incompletes are failures while downloading the qcow2 image. Checking the Next & Previous tab in all these 3 tests shows that these have passed earlier or later, so the only explanation that I have for the missing qcow2 image failure is that the qdevice test started after osd had already cleaned the asset from the storage.

The 12 failures amount to a 2.4% failure rate. This percentage increases to 7.8% if removing the skipped tests and adding the parallel failed ones.

However, if qdevice is a 2 node test, why are there so many more parallel failed jobs than failed jobs? The explanation lies on the failed qnetd jobs that run in parallel to the qdevice nodes. Checking https://openqa.suse.de/tests?match=ha_qnetd shows 6 failures within the last 500 jobs, so that accounts for an extra 12 parallel failed jobs.

But then when we get to why these tests failed, we see that:

So, out of 1000 jobs qdevice/qnetd tests:

  • 18 failures.
  • 8 of those due to screen rendering --> Not related to the tests themselves.
  • 4 possibly due to a product bug
  • 1 due to worker performance --> Not related to the test itself.
  • 2 due to support server network connectivity issues --> Not related to the test themselves.
  • 1 due to failed connection to IBS --> Could be due to a test setting, QAM bot test scheduling, or IBS repository not present.
  • 2 due to HA dependencies not starting --> Could be product issue. Could be worker performance.

Not a single failure related to this ticket. And more than that, not enough grounds for removing these tests from the schedule.

No idea why the focus on the qdevice/qnetd tests when this issue was opened for several different scenarios.

I also insist that decreasing test coverage should not be the approach, but I will refrain from beating this dead horse.

Actions #32

Updated by okurz over 2 years ago

acarvajal wrote:

[…]

I appreciate your thorough analysis and all of it up to this point is completely correctly evaluated. But still we should react on the problem at hand: QE-SAP and/or HA tests sometimes "randomly fail", e.g. due to the reported "spontaneous reboot issues". The consequence is that qa-maintenance/bot-ng will not auto-approve according SLE maintenance updates and the group of openQA maintenance test reviewers that should merely "coordinate" are asked for help to move the according SLE maintenance updates forward. 1. These test reviewers should not even need to do that job because the team qe-sap should do it (see https://confluence.suse.com/display/qasle/openQA+QE+Maintenance+Review ) and 2. The openQA maintenance test reviewers do not have the capacity to fix all the different test instabilities themselves. So based on which individual issues you identified let me provide some specific questions or suggestions:

So, out of 1000 jobs qdevice/qnetd tests:

  • 18 failures.
  • 8 of those due to screen rendering --> Not related to the tests themselves.

even if that is the case. Is there a ticket to improve that situation?

  • 4 possibly due to a product bug

so where is the follow-up for that? Just the job that was mentioned by openqa-review https://openqa.suse.de/tests/7393110#step/check_after_reboot/15 shows that a service fails to start up. If all cases would be like these then we should be be good. But this only works if everybody sees the test failures the same way.

  • 1 due to worker performance --> Not related to the test itself.

ok, but which ticket covers this then? Commonly people say "worker performance issue" when tests apply too strict timeouts which need to be changed on the test level. If you find issues where you really think that it's such a bad performance that test changes won't help then please let us know about the specific cases so that we can look into them.

  • 2 due to support server network connectivity issues --> Not related to the test themselves.

but which ticket then?

  • 1 due to failed connection to IBS --> Could be due to a test setting, QAM bot test scheduling, or IBS repository not present.

in that case tests can still be improved with better retrying

[…]

Not a single failure related to this ticket. And more than that, not enough grounds for removing these tests from the schedule.

No idea why the focus on the qdevice/qnetd tests when this issue was opened for several different scenarios.

I came back to this ticket for two reasons: Because there is https://progress.opensuse.org/issues/95458#note-29 pointing to a test failing due to the "test issue" described in this ticket (even though this might not be true but the test is labeled like that). And second because the due-date was exceeded by more than ten days so violating https://progress.opensuse.org/projects/openqatests/wiki#SLOs-service-level-objectives hence I noticed it.

I also insist that decreasing test coverage should not be the approach, but I will refrain from beating this dead horse.

Well, removing the test (temporarily) until fixed should only be seen as last resort. And to be honest: I repeatedly mention that option as the way I would go to spawn some motivation to fix it ;)

Actions #33

Updated by acarvajal over 2 years ago

okurz wrote:

  1. These test reviewers should not even need to do that job because the team qe-sap should do it (see https://confluence.suse.com/display/qasle/openQA+QE+Maintenance+Review ) and 2. The openQA maintenance test reviewers do not have the capacity to fix all the different test instabilities themselves.

No disagreements there. What I disagree on - and what I have repeated more times than I care to count - is that the solution should never be dropping coverage. Even if as you say "randomly failing" tests impact automatic reviews so much, it's a practice that leads to a false sense of security. Even when the practice is followed on, it can lead to unacceptable drops in coverage (for example, see https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/merge_requests/150 & https://progress.opensuse.org/issues/68932).

So based on which individual issues you identified let me provide some specific questions or suggestions:

OK. Feels like you're dodging my argument which was "why remove from the schedule a test that has a success rate of over 90%?", but let's go ahead.

So, out of 1000 jobs qdevice/qnetd tests:

  • 18 failures.
  • 8 of those due to screen rendering --> Not related to the tests themselves.

even if that is the case. Is there a ticket to improve that situation?

No idea. Is it?

  • 4 possibly due to a product bug

so where is the follow-up for that? Just the job that was mentioned by openqa-review https://openqa.suse.de/tests/7393110#step/check_after_reboot/15 shows that a service fails to start up. If all cases would be like these then we should be be good. But this only works if everybody sees the test failures the same way.

Huh? I did mention https://openqa.suse.de/tests/7393110#step/check_after_reboot/15 below.

Is there a follow up to the 4 tests failing due to a product bug? Cannot say. First time I saw these failures was yesterday, and from what I could see, later tests on those scenarios passed.

Should I open a bug for an issue that's already fixed?

  • 1 due to worker performance --> Not related to the test itself.

ok, but which ticket covers this then? Commonly people say "worker performance issue" when tests apply too strict timeouts which need to be changed on the test level. If you find issues where you really think that it's such a bad performance that test changes won't help then please let us know about the specific cases so that we can look into them.

"Test apply too strict timeouts"???? Give me a break!

If the response to any issue presented is always going to be "perhaps you have a wrong setting", "perhaps there is a problem in the test code" or "perhaps you were too aggressive with a timeout", then we will continue to ignore potential issues.

In this particular case:

Command waited for 2 whole minutes before failing. Is 2 minutes too aggressive? I do not think so.

I grant that I may have misspoken when claiming "slow worker" though. On that command, issue could also be network related. Sadly no way to tell from the failure, and of course same test, same scenario, but a later run (several in fact) show passing results: https://openqa.suse.de/tests/7549015

  • 2 due to support server network connectivity issues --> Not related to the test themselves.

but which ticket then?

https://progress.opensuse.org/issues/95788

  • 1 due to failed connection to IBS --> Could be due to a test setting, QAM bot test scheduling, or IBS repository not present.

in that case tests can still be improved with better retrying

Agree.

Well, removing the test (temporarily) until fixed should only be seen as last resort. And to be honest: I repeatedly mention that option as the way I would go to spawn some motivation to fix it ;)

Understood. Hope you succeed and manage to mobilize all the teams required to get to a solution. From my experience, even if QE-SAP are the experts on these scenarios, these random failures usually fall outside of QE-SAP field of expertise, so I do agree and believe that a coordinated effort is required.

Actions #34

Updated by rbranco over 2 years ago

  • Due date changed from 2021-10-15 to 2022-01-31
Actions #35

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_beta_node02
https://openqa.suse.de/tests/7671905

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234
Actions #36

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_hawk_haproxy_node01
https://openqa.suse.de/tests/7749089

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234
Actions #37

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_ctdb_node01
https://openqa.suse.de/tests/7829513

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234
Actions #38

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_ctdb_node02
https://openqa.suse.de/tests/7871807

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234
Actions #39

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_diskless_sbd_qdevice_node1
https://openqa.suse.de/tests/7976342

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234
Actions #40

Updated by jkohoutek over 2 years ago

Today I was looking it this again when solving issues with this: https://openqa.suse.de/tests/8012986#step/check_logs/2

From my observation it look like, that ALL jobs which running time reaching 2h fails, but the faster ones around 1h success. Between those it's random, but also usually success: https://openqa.suse.de/tests/8005979#next_previous

Question is, why the same update once took almost 2 hours and fails:

 check_logs   :22413:saptune  3 days ago ( 01:55 hours ) 

but day later it took just a 1 hour and success:

  :22413:saptune  2 days ago ( 01:10 hours ) 
Actions #41

Updated by rbranco about 2 years ago

  • Due date changed from 2022-01-31 to 2022-04-30
Actions #42

Updated by openqa_review about 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_upgrade_migration_node01
https://openqa.suse.de/tests/8192134

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234
Actions #44

Updated by okurz about 2 years ago

I can confirm. So far this looks good. openqa-query-for-job-label poo#95458 shows

2201470|2022-02-22 04:16:29|done|failed|container-host-microos||openqaworker7
8318522|2022-03-13 03:51:09|done|failed|qam_ha_rolling_update_node01||openqaworker9
8306778|2022-03-10 14:14:56|done|failed|ha_beta_node02||malbec
8296656|2022-03-09 17:30:46|done|failed|ha_beta_node02||QA-Power8-5-kvm
8297435|2022-03-09 12:45:09|done|failed|migration_offline_dvd_verify_sle15sp1_ltss_ha_alpha_node01||openqaworker3
8297914|2022-03-09 10:33:55|done|failed|ha_hawk_haproxy_node02||openqaworker6
8297846|2022-03-09 10:07:52|done|failed|ha_alpha_node01||QA-Power8-5-kvm
8292351|2022-03-09 05:02:30|done|failed|ha_qdevice_node2||QA-Power8-5-kvm
8293695|2022-03-09 04:59:57|done|failed|ha_qdevice_node2||openqaworker-arm-1
8293662|2022-03-09 04:38:34|done|failed|ha_ctdb_node02||openqaworker-arm-2
8293690|2022-03-09 03:52:44|done|failed|ha_priority_fencing_node01||openqaworker-arm-3

so the latest failures were on 2022-03-09

Actions #45

Updated by openqa_review about 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_priority_fencing_node01
https://openqa.suse.de/tests/8439544#step/iscsi_client/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #46

Updated by openqa_review almost 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_delta_node02
https://openqa.suse.de/tests/8741156#step/iscsi_client/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 60 days if nothing changes in this ticket.

Actions #47

Updated by bschmidt over 1 year ago

  • Status changed from Resolved to In Progress

unfortunately this happens again :-(
see https://openqa.suse.de/tests/9679509#step/check_after_reboot/30

Actions #48

Updated by slo-gin over 1 year ago

This ticket is 10 days after the due-date. Please consider closing this ticket or move the due-date accordingly.

Actions #49

Updated by rbranco over 1 year ago

I vote for closing this ticket as the issue has nothing to do with SAP/HA.

Actions #50

Updated by slo-gin over 1 year ago

This ticket is 10 days after the due-date. Please consider closing this ticket or move the due-date accordingly.

Actions #51

Updated by acarvajal over 1 year ago

rbranco wrote:

I vote for closing this ticket as the issue has nothing to do with SAP/HA.

Is the issue gone? Judging by https://progress.opensuse.org/issues/95458#note-47 it isn't.

Before closing I would vote for re-assignment.

Actions #52

Updated by slo-gin over 1 year ago

This ticket is 10 days after the due-date. Please consider closing this ticket or move the due-date accordingly.

Actions #53

Updated by rbranco over 1 year ago

acarvajal wrote:

rbranco wrote:

I vote for closing this ticket as the issue has nothing to do with SAP/HA.

Is the issue gone? Judging by https://progress.opensuse.org/issues/95458#note-47 it isn't.

Before closing I would vote for re-assignment.

This poo is too generic IMHO. Can you please reassign to someone? I will be in squad rotation in November until February.

Actions #54

Updated by slo-gin over 1 year ago

This ticket is 10 days after the due-date. Please consider closing this ticket or move the due-date accordingly.

Actions #55

Updated by okurz over 1 year ago

  • Due date deleted (2022-04-30)

This ticket had a due set but exceeded it already by more than 14 days. We would like to take the due date seriously so please update the ticket accordingly (resolve the ticket or update the due-date or remove the due-date). See https://progress.opensuse.org/projects/openqatests/wiki/Wiki#SLOs-service-level-objectives for details.

Actions #56

Updated by openqa_review over 1 year ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: migration_offline_dvd_verify_sle12sp5_ha_alpha_node01_atmg
https://openqa.suse.de/tests/9918083#step/migrate_clvmd_to_lvmlockd/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 44 days if nothing changes in this ticket.

Actions #57

Updated by rbranco over 1 year ago

  • Assignee changed from rbranco to fgerling
Actions #58

Updated by openqa_review over 1 year ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: migration_offline_dvd_verify_sle15sp2_ltss_ha_alpha_node02
https://openqa.suse.de/tests/10027954#step/check_after_reboot/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #59

Updated by vsvecova over 1 year ago

Hello @fgerling, I'm wondering whether there is any update about this issue? It has been blocking the autoapproval for quite a few maintenance updates lately.

Actions #60

Updated by fgerling over 1 year ago

  • Status changed from In Progress to Feedback

The last comments from Alvaro and Ricardo are mentioning that it is a non SAP specific issue. Is there a process to hand it over?
I requested feedback from PO, in regards to priority, and will update here when I get an answer.

Actions #61

Updated by vsvecova over 1 year ago

I'm not sure how you define specific, but I don't recall seeing this issue anywhere else than HA-related jobs. I'm not aware of any ticket hand-over process; I guess the rule of thumb has always been that the squad whose tests are failing is also responsible for the fix. In any case, I'm pondering about the usefulness of a test that fails so often. Wouldn't it make more sense to just unschedule them?

Actions #62

Updated by LMartin over 1 year ago

  • Assignee changed from fgerling to bschmidt

This ticket describes SUT reboots which are unexpected and sporadic, e.g. #note-31 and #note-32 from a year ago. Indeed, it was so sporadic back then that it was hard to find a reproducer.
However looking at the recent fails from 15 SP5 in this ticket ( https://openqa.suse.de/tests/10027954#next_previous and https://openqa.suse.de/tests/10027954#next_previous ) those new failures are very frequent. So those two cases are either the reproducer which has been asked for in this ticket, or an actual product bug which needs attention.

Miura and Birger: can you please check and give feedback here if https://openqa.suse.de/tests/10027954#next_previous and https://openqa.suse.de/tests/10027954#next_previous are real issues or the sporadic unexpected reboots described here.

Regarding autoapprovals of maintenance updates I have asked Ednilson Miura and Birger Schmidt from QE-SAP to keep an extra eye http://dashboard.qam.suse.de/blocked to make sure SAP/HA tests are not unnecessarily blocking updates due to these sporadic SUT reboots, e.g. those tests should (as a workaround) be retriggered & assessed to see if they are real issues or these sporadic reboots. If you see QE-SAP blocking maintenance updates, please feel free to reach out and ask for the time being.

For a longer term solution I need to verify with Alvaro when he returns from vacation. I need to understand if there is a commonality with these failures, e.g. always migration or some specific HA test(s)? or only migration etc.
In 2023 we can for sure migrate the tests to new workers, but based on what I read above, the resolution is probably not that simple.

And no, unscheduling tests does not make sense in my view. Fixing broken tests makes absolute sense though.

Actions #63

Updated by openqa_review over 1 year ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: migration_offline_dvd_verify_sle15sp2_ltss_ha_alpha_node02
https://openqa.suse.de/tests/10218340#step/check_after_reboot/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #64

Updated by tinita about 1 year ago

I can see that grep sometimes times out with the regex in the title.
Since (?s) is used, every .* can span over multiple lines, and that involves a lot of backtracking and might not be needed. Please consider to change some of the .* to [^\n]*, or drop the (?s) and change the .* that has to be line spanning to [\S\s]*.

Actions #65

Updated by openqa_review about 1 year ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: migration_offline_scc_verify_sle12sp4_ltss_ha_alpha_node02
https://openqa.suse.de/tests/10562975#step/migrate_clvmd_to_lvmlockd/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 80 days if nothing changes in this ticket.

Actions #66

Updated by openqa_review 11 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: xfstests_xfs-xfs-reflink
https://openqa.suse.de/tests/11162974#step/run/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 160 days if nothing changes in this ticket.

Actions #67

Updated by acarvajal 8 months ago

  • Assignee changed from bschmidt to acarvajal
Actions #68

Updated by acarvajal 8 months ago

  • Category changed from Bugs in existing tests to Infrastructure
  • Status changed from Feedback to Closed

Closing this until issue is seen again to reference more recent jobs.

Actions #69

Updated by openqa_review 5 months ago

  • Status changed from Closed to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_ctdb_node01
https://openqa.suse.de/tests/12857331#step/iscsi_client/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 40 days if nothing changes in this ticket.

Actions

Also available in: Atom PDF