Project

General

Profile

action #95458

[qe-sap][ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry

Added by acarvajal 10 months ago. Updated 4 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Target version:
Start date:
2021-07-13
Due date:
2022-04-30
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA tests in HA scenarios (2 or 3 node clusters) fail in different modules due to unexpected reboots
in one or more of the SUTs:

  1. QAM HA qdevice node 1, fails in ha_cluster_init module
  2. QAM HA rolling upgrade migration, node 2, fails in filesystem module
  3. QAM HA hawk/HAProxy node 1, fails in check_after_reboot module
  4. QAM 2 nodes, node 1, fails in ha_cluster_init module

Test suite description

The base test suite is used for job templates defined in YAML documents. It has no settings of its own.

Reproducible

Issue is very sporadic, and reproducing it is not always possible. Usually, re-triggering the jobs lead to the tests passing.

For example, from the jobs above, re-triggered jobs succeeded:

  1. https://openqa.suse.de/tests/6435165
  2. https://openqa.suse.de/tests/6435380
  3. https://openqa.suse.de/tests/6435384
  4. https://openqa.suse.de/tests/6435389

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#95458

Expected result

  1. Last good: :20349:samba (or more recent)
  2. Last good: :20121:crmsh
  3. Last good: 20321:kernel-ec2
  4. Last good: MR:244261:crmsh

Related issues

Related to openQA Tests - action #95788: [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network Resolved2021-07-21

Related to openQA Tests - action #94171: [qem][sap] test fails in check_logs about 50% of timesRejected2021-06-17

Related to openQA Tests - action #97013: [qe-core][qe-yast] test fails in handle_reboot, patch_and_reboot, installationNew2021-08-17

History

#1 Updated by maritawerner 10 months ago

  • Subject changed from SUT reboots unexpectedly, leading to tests failing in HA scenarios to [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios

#2 Updated by szarate 10 months ago

  • Project changed from openQA Tests to openQA Project
  • Category changed from Bugs in existing tests to Concrete Bugs

Another example can be: https://openqa.suse.de/tests/6410169#step/filesystem/28

After checking the logs, and referencing with the system journal the only rough hint I get is:

Jul 11 01:36:26 openqaworker5 kernel: kvm [38069]: vcpu0, guest rIP: 0xffffffff924776b8 disabled perfctr wrmsr: 0xc2 data 0xffff

which corresponds more or less to the last time the test ran one of those commands... I see no coredumps whatsoever... but that message is a bit puzzling (they repeat every now and then too inside the worker, but other jobs don't have the problem apparently)

PS: Moved to the openQA project for now, although I'm torn between infraestructure or this project itself

#3 Updated by MDoucha 10 months ago

  • Project changed from openQA Project to openQA Tests
  • Category deleted (Concrete Bugs)

My first guess would be that the test somehow gets into a kernel panic. Add the ignore_level kernel command line parameter to grub.cfg during incident installation to see kernel backtraces in serial console logs. Here's an example how LTP adds the ignore_level kernel parameter:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/kernel/install_ltp.pm#L353

#4 Updated by MDoucha 10 months ago

  • Project changed from openQA Tests to openQA Project
  • Category set to Concrete Bugs

Oops, sorry for overwriting some metadata.

#5 Updated by okurz 10 months ago

  • Category changed from Concrete Bugs to Support
  • Status changed from New to Feedback
  • Assignee set to okurz
  • Target version set to Ready

hm, if there would be a kernel panic than the serial log should show at least something. But the system acts like during a forced power reset. https://openqa.suse.de/tests/6410169/logfile?filename=serial0.txt mentions as last token "Wf8kQ-0-" before the next command is executed but there is nothing after that token in the serial log.

I also manually checked the video from https://openqa.suse.de/tests/6410169/file/video.ogv and stepped through the frames one by one and have not found anything between the healthy bash session like in https://openqa.suse.de/tests/6410169#step/filesystem/28 and the grub menu on boot like https://openqa.suse.de/tests/6410169#step/filesystem/29

According to https://bugs.centos.org/view.php?id=6730 and https://bugzilla.redhat.com/show_bug.cgi?id=507085 messages like "kvm: vcpu0, guest rIP disabled perfctr wrmsr" are considered harmless. I doubt they are related to the problems we see.

acarvajal I suggest you try the kernel command line parameters that mdoucha suggested. Also you could try if QEMUCPU=host makes any difference, just a hunch :)

EDIT: If you like others to pickup I suggest you try to come up with "steps to reproduce", e.g. an openqa-cli api -X post isos command line to trigger a safe set of jobs that do not interfer with production for crosschecking. Then we could potentially also ask someone else from the tools team to take over.

#6 Updated by okurz 10 months ago

  • Due date set to 2021-07-30

#7 Updated by acarvajal 10 months ago

okurz wrote:

acarvajal I suggest you try the kernel command line parameters that mdoucha suggested. Also you could try if QEMUCPU=host makes any difference, just a hunch :)

I have added QEMUCPU=host in the tests that run in the HANA validation openQA instance. Last time there were 3 failures with this issue out of 200+ jobs, so I guess it is a good place to try. I should see some results in the next runs on Monday.

BTW, interesting hunch. I am not seeing this issue in Power9 (100+ jobs in that same openQA instace), which made think that whatever's causing it could be related to qemu. I'll come back with some results next Monday.

I will also begin planning to introduce mdoucha's suggestions to gather more logs for the tests in osd.

#8 Updated by okurz 10 months ago

acarvajal wrote:

okurz wrote:

acarvajal I suggest you try the kernel command line parameters that mdoucha suggested. Also you could try if QEMUCPU=host makes any difference, just a hunch :)

I have added QEMUCPU=host in the tests that run in the HANA validation openQA instance. Last time there were 3 failures with this issue out of 200+ jobs, so I guess it is a good place to try. I should see some results in the next runs on Monday.

What I thought of is to not change the production tests but rather trigger an additional, dedicated test set, e.g. following https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation and after the weekend look at the corresponding test overview page to quickly get the overview of how many tests passed/failed.

#9 Updated by acarvajal 10 months ago

I had to restart 2 tests with the failure yesterday in the HANA validation openQA instance.

So the difference between using QEMUCPU=host or not was 3 out of 200+ last week to 2 out of 200+ this week. I don't think this is statistically relevant, and the bad news is that the issue is still present.

I would look into implementing mdoucha's suggestions and triggering some additional jobs with and without QEMUCPU=host to do a more thorough analysis.

#10 Updated by okurz 10 months ago

  • Subject changed from [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios to [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha/filesystem.pm:81.*command.*mkfs -t ocfs2.*timed out":retry

I have seen the symptom

[2021-07-11T01:38:13.388 CEST] [info] ::: basetest::runtest: # Test died: command 'mkfs -t ocfs2 -F -N 16 "/dev/vg_cluster_md/lv_openqa"' timed out at /usr/lib/os-autoinst/testapi.pm line 959.

likely more than once – or I keep getting back to the same job ;) – still, I am trying with auto-review to automatically detect such cases and retrigger according tests. The regex should be specific enough to only catch HAS tests but at any time it can be generalized or extended to cover alternative symptoms. quoting acarvajal "hope you can find an easy way to detect those cases .... it's not easy as it can happen anytime and in any module during the test" but we should start somewhere :)

#11 Updated by acarvajal 10 months ago

okurz wrote:

I have seen the symptom

[2021-07-11T01:38:13.388 CEST] [info] ::: basetest::runtest: # Test died: command 'mkfs -t ocfs2 -F -N 16 "/dev/vg_cluster_md/lv_openqa"' timed out at /usr/lib/os-autoinst/testapi.pm line 959.

likely more than once – or I keep getting back to the same job ;) – still, I am trying with auto-review to automatically detect such cases and retrigger according tests. The regex should be specific enough to only catch HAS tests but at any time it can be generalized or extended to cover alternative symptoms. quoting acarvajal "hope you can find an easy way to detect those cases .... it's not easy as it can happen anytime and in any module during the test" but we should start somewhere :)

EDIT: Tested with echo https://openqa.suse.de/tests/6410236 | env host=openqa.suse.de openqa-investigate with openqa-investigate from github.com/os-autoinst/scripts/ . But openqa-investigate currently does not support cloning jobs that are part of a multi-machine cluster, see https://github.com/os-autoinst/scripts/commit/371467dafcefb9182530c790c33632f8cfa9a297#diff-f73cf39a07f6cf8cdb453862496919d06df16d07e58b274e68ea148dd1f7dae5

That's one of the symptoms.

I'd say whenever SUT is unexpectedly rebooted, tests will fail in one of two ways depending on what the test module was doing:

  1. A time out in an assert_script_run (or similar), such as the sympton in this filesystem test module failure.
  2. A failure in assert_screen such as in https://openqa.suse.de/tests/6426340#step/check_after_reboot/5

Since the majority of the HA test modules rely either on the root_console or the serial terminal, I think the first case will be more common, but I don't know if having a general rule to re-start tests when commands time out is safe.

#12 Updated by okurz 10 months ago

  • Subject changed from [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha/filesystem.pm:81.*command.*mkfs -t ocfs2.*timed out":retry to [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*command.*mkfs -t ocfs2.*timed out.*match=root-console timed out":retry

#13 Updated by okurz 10 months ago

Tested the auto-review regex with

$ echo https://openqa.suse.de/tests/6410225 | env dry_run=1 host=openqa.suse.de ./openqa-label-known-issues
openqa-cli api --host https://openqa.suse.de -X POST jobs/6410225/comments text=poo#95458 [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*command.*mkfs -t ocfs2.*timed out.*match=root-console timed out":retry
openqa-cli api --host https://openqa.suse.de -X POST jobs/6410225/restart

so a comment would have been written and the test should have been restarted, assuming this works this way over the API for multi-machine clusters.

#14 Updated by okurz 10 months ago

  • Subject changed from [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*command.*mkfs -t ocfs2.*timed out.*match=root-console timed out":retry to [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry

#15 Updated by okurz 10 months ago

  • Related to action #95788: [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network added

#16 Updated by okurz 10 months ago

  • Description updated (diff)

#17 Updated by okurz 10 months ago

  • Project changed from openQA Project to openQA Tests
  • Subject changed from [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry to [ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry
  • Category changed from Support to Bugs in existing tests
  • Status changed from Feedback to Workable
  • Assignee changed from okurz to acarvajal

the auto-review regex matching might be a bit too broad as it also catches issues like https://openqa.suse.de/tests/6581119#step/iscsi_client/18 where the test fails in iscsi and then also the post_fail_hook fails to select a free root terminal. However this is all within the scope of tests/ha so I will leave this to you again. As followup to #95458#note-2 , sorry, I don't see how this is a problem with openQA itself.

#18 Updated by MDoucha 10 months ago

MDoucha wrote:

My first guess would be that the test somehow gets into a kernel panic. Add the ignore_level kernel command line parameter to grub.cfg during incident installation to see kernel backtraces in serial console logs. Here's an example how LTP adds the ignore_level kernel parameter:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/kernel/install_ltp.pm#L353

Correction of my suggestion: the kernel command line parameter is actually ignore_loglevel. Also updating the link above to permalink:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/49342da7528b6bc0a8b418090487bc40c7f8e4ce/tests/kernel/install_ltp.pm#L359

#19 Updated by okurz 10 months ago

  • Due date changed from 2021-07-30 to 2021-08-06
  • Target version changed from Ready to future

acarvajal are you ok to continue here?

#20 Updated by okurz 10 months ago

  • Related to action #94171: [qem][sap] test fails in check_logs about 50% of times added

#21 Updated by acarvajal 9 months ago

  • Project changed from openQA Tests to openQA Project
  • Due date changed from 2021-08-06 to 2021-09-10
  • Category deleted (Bugs in existing tests)
  • Status changed from Workable to Feedback
  • Assignee changed from acarvajal to okurz
  • Target version changed from future to Ready

okurz wrote:

acarvajal are you ok to continue here?

Yes. I think so. I will probably sync with you before doing so though.

#22 Updated by acarvajal 9 months ago

  • Project changed from openQA Project to openQA Tests
  • Category set to Bugs in existing tests
  • Status changed from Feedback to Workable
  • Assignee changed from okurz to acarvajal
  • Target version changed from Ready to future

#23 Updated by szarate 9 months ago

  • Related to action #97013: [qe-core][qe-yast] test fails in handle_reboot, patch_and_reboot, installation added

#24 Updated by openqa_review 9 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: migration_offline_scc_verify_sle15sp1_ltss_ha_alpha_node02
https://openqa.suse.de/tests/6958789

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The label in the openQA scenario is removed

#25 Updated by okurz 8 months ago

  • Subject changed from [ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry to [qe-sap][ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry

Using keyword "qe-sap" as verified by jmichel in weekly QE sync 2021-09-15

#26 Updated by okurz 8 months ago

this ticket is exceeding its due-date. It popped up during the weekly QE sync 2021-09-22. We would appreciate a reaction within the next days, at least updating the due-date according to what we can realistically expect. See https://progress.opensuse.org/projects/openqatests/wiki#SLOs-service-level-objectives for details

#27 Updated by acarvajal 8 months ago

  • Due date changed from 2021-09-10 to 2021-10-15

#28 Updated by openqa_review 7 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_qdevice_node2
https://openqa.suse.de/tests/7350101

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

#29 Updated by openqa_review 7 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_qdevice_node2
https://openqa.suse.de/tests/7393110

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

#30 Updated by okurz 7 months ago

so we discussed in a meeting. Thank you for inviting me :)

As this is a ticket about the tests themselves I suggest that you within qe-sap drive the work. And from the tools team we will be able to provide support and stand ready for collaboration. If you don't plan to have the issue fixed soon the according tests can also be unscheduled according to the QAM processes until (hopefully) eventually the tests can be brought back. One additional observation from my side: Some months ago we still had highly stable multi-machine tests for the areas "network" as well as "HPC" and "SAP" but it seems all three areas have not received a lot of love. There are multiple other areas where multi-machine tests are running just fine so I am not aware of any generic openQA problems. What I see are limited to aforementioned areas hence it's unlikely that the issues will go away until explicitly addressed because they are domain-specific. Based on my former experiences those issues could very well point to valid product regressions.
And https://openqa.suse.de/tests?match=wicked_ shows that openQA multi-machine tests relying on the network can work very reliably. No network related problem showing up there. A quick SQL query revealed that for the fail ratio of "wicked" tests within the last 10 days is 0.6% so very low. This shows that openQA multi-machine tests can be very stable and also shows that we don't have a generic problem in our tooling or infrastructure. Regardless if it's sporadic product issues or test design flaws at this point I recommend to focus on mitigating the negative impact on the openQA review procedures in short-term and remove the according tests from the schedule. Also see https://confluence.suse.com/display/openqa/QAM+openQA+review+guide as reference. During the time the tests are not within the validation schedule QE-SAP would need to ensure by other means that the product quality is sufficient.

acarvajal does it make sense for you to stay assigned to this ticket?

rbranco as direct followup to what we just discussed, have you seen #95458#note-29 ? I suggest to remove ha_qdevice_node2 and all related test suites from the validation job groups until the problem can be fixed. Can you create an according merge request for the test schedule in https://gitlab.suse.de/qa-maintenance/qam-openqa-yml ?

#31 Updated by acarvajal 7 months ago

  • Assignee changed from acarvajal to rbranco

okurz wrote:

And https://openqa.suse.de/tests?match=wicked_ shows that openQA multi-machine tests relying on the network can work very reliably. No network related problem showing up there.

This issue is definitely not network-related. 100% agreement with you there. The ticket that was tracking network issues in HA QAM tests is: https://progress.opensuse.org/issues/95788

This ticket tracks instead the sporadic failure where the SUT VM reboots when test code is not expecting it, and even if this is more frequent in MM scenarios, it can also happen on SM scenarios.

And this is not limited only to osd. In openqa.wdf.sap.corp I've noticed the same issue in:

  • 9 jobs out of 384 jobs total for this week.
  • 4 jobs out of 384 jobs total for the week starting on 18.10.2021.
  • 4 jobs out of 384 jobs total for the week starting on 11.10.2021.
  • 12 jobs out of 384 jobs total for the week starting on 4.10.2021

All these jobs are running with QEMUCPU=host since it was suggested in https://progress.opensuse.org/issues/95458#note-5, but even without that setting, failure rate was more or less the same. It is a low failure rate though, at around 1.8%.

Restarting all these tests results in them passing. Hence no other assumption to make except that a transient race condition is to blame for these failures, and not something related to the tests themselves.

A quick SQL query revealed that for the fail ratio of "wicked" tests within the last 10 days is 0.6% so very low.

What would be the fail ratio of "qdevice" and "qnetd" in the same period? Can you point me to where/how I can get that data myself?

acarvajal does it make sense for you to stay assigned to this ticket?

I do not think so. I asked @jmichel who should I assign it to, so I am assigning this to @rbranco.

rbranco as direct followup to what we just discussed, have you seen #95458#note-29 ? I suggest to remove ha_qdevice_node2 and all related test suites from the validation job groups until the problem can be fixed. Can you create an according merge request for the test schedule in https://gitlab.suse.de/qa-maintenance/qam-openqa-yml ?

Following up on the "wicked" exercise from above, if we go to: https://openqa.suse.de/tests?match=ha_qdevice

We'll see that out of the last 500 qdevice jobs finished in osd, there are at the time of this writing:

343 passed, 109 softfailed, 12 failed, 6 skipped, 3 incomplete and 27 parallel failed

All 3 incompletes are failures while downloading the qcow2 image. Checking the Next & Previous tab in all these 3 tests shows that these have passed earlier or later, so the only explanation that I have for the missing qcow2 image failure is that the qdevice test started after osd had already cleaned the asset from the storage.

The 12 failures amount to a 2.4% failure rate. This percentage increases to 7.8% if removing the skipped tests and adding the parallel failed ones.

However, if qdevice is a 2 node test, why are there so many more parallel failed jobs than failed jobs? The explanation lies on the failed qnetd jobs that run in parallel to the qdevice nodes. Checking https://openqa.suse.de/tests?match=ha_qnetd shows 6 failures within the last 500 jobs, so that accounts for an extra 12 parallel failed jobs.

But then when we get to why these tests failed, we see that:

So, out of 1000 jobs qdevice/qnetd tests:

  • 18 failures.
  • 8 of those due to screen rendering --> Not related to the tests themselves.
  • 4 possibly due to a product bug
  • 1 due to worker performance --> Not related to the test itself.
  • 2 due to support server network connectivity issues --> Not related to the test themselves.
  • 1 due to failed connection to IBS --> Could be due to a test setting, QAM bot test scheduling, or IBS repository not present.
  • 2 due to HA dependencies not starting --> Could be product issue. Could be worker performance.

Not a single failure related to this ticket. And more than that, not enough grounds for removing these tests from the schedule.

No idea why the focus on the qdevice/qnetd tests when this issue was opened for several different scenarios.

I also insist that decreasing test coverage should not be the approach, but I will refrain from beating this dead horse.

#32 Updated by okurz 7 months ago

acarvajal wrote:

[…]

I appreciate your thorough analysis and all of it up to this point is completely correctly evaluated. But still we should react on the problem at hand: QE-SAP and/or HA tests sometimes "randomly fail", e.g. due to the reported "spontaneous reboot issues". The consequence is that qa-maintenance/bot-ng will not auto-approve according SLE maintenance updates and the group of openQA maintenance test reviewers that should merely "coordinate" are asked for help to move the according SLE maintenance updates forward. 1. These test reviewers should not even need to do that job because the team qe-sap should do it (see https://confluence.suse.com/display/qasle/openQA+QE+Maintenance+Review ) and 2. The openQA maintenance test reviewers do not have the capacity to fix all the different test instabilities themselves. So based on which individual issues you identified let me provide some specific questions or suggestions:

So, out of 1000 jobs qdevice/qnetd tests:

  • 18 failures.
  • 8 of those due to screen rendering --> Not related to the tests themselves.

even if that is the case. Is there a ticket to improve that situation?

  • 4 possibly due to a product bug

so where is the follow-up for that? Just the job that was mentioned by openqa-review https://openqa.suse.de/tests/7393110#step/check_after_reboot/15 shows that a service fails to start up. If all cases would be like these then we should be be good. But this only works if everybody sees the test failures the same way.

  • 1 due to worker performance --> Not related to the test itself.

ok, but which ticket covers this then? Commonly people say "worker performance issue" when tests apply too strict timeouts which need to be changed on the test level. If you find issues where you really think that it's such a bad performance that test changes won't help then please let us know about the specific cases so that we can look into them.

  • 2 due to support server network connectivity issues --> Not related to the test themselves.

but which ticket then?

  • 1 due to failed connection to IBS --> Could be due to a test setting, QAM bot test scheduling, or IBS repository not present.

in that case tests can still be improved with better retrying

[…]

Not a single failure related to this ticket. And more than that, not enough grounds for removing these tests from the schedule.

No idea why the focus on the qdevice/qnetd tests when this issue was opened for several different scenarios.

I came back to this ticket for two reasons: Because there is https://progress.opensuse.org/issues/95458#note-29 pointing to a test failing due to the "test issue" described in this ticket (even though this might not be true but the test is labeled like that). And second because the due-date was exceeded by more than ten days so violating https://progress.opensuse.org/projects/openqatests/wiki#SLOs-service-level-objectives hence I noticed it.

I also insist that decreasing test coverage should not be the approach, but I will refrain from beating this dead horse.

Well, removing the test (temporarily) until fixed should only be seen as last resort. And to be honest: I repeatedly mention that option as the way I would go to spawn some motivation to fix it ;)

#33 Updated by acarvajal 7 months ago

okurz wrote:

  1. These test reviewers should not even need to do that job because the team qe-sap should do it (see https://confluence.suse.com/display/qasle/openQA+QE+Maintenance+Review ) and 2. The openQA maintenance test reviewers do not have the capacity to fix all the different test instabilities themselves.

No disagreements there. What I disagree on - and what I have repeated more times than I care to count - is that the solution should never be dropping coverage. Even if as you say "randomly failing" tests impact automatic reviews so much, it's a practice that leads to a false sense of security. Even when the practice is followed on, it can lead to unacceptable drops in coverage (for example, see https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/merge_requests/150 & https://progress.opensuse.org/issues/68932).

So based on which individual issues you identified let me provide some specific questions or suggestions:

OK. Feels like you're dodging my argument which was "why remove from the schedule a test that has a success rate of over 90%?", but let's go ahead.

So, out of 1000 jobs qdevice/qnetd tests:

  • 18 failures.
  • 8 of those due to screen rendering --> Not related to the tests themselves.

even if that is the case. Is there a ticket to improve that situation?

No idea. Is it?

  • 4 possibly due to a product bug

so where is the follow-up for that? Just the job that was mentioned by openqa-review https://openqa.suse.de/tests/7393110#step/check_after_reboot/15 shows that a service fails to start up. If all cases would be like these then we should be be good. But this only works if everybody sees the test failures the same way.

Huh? I did mention https://openqa.suse.de/tests/7393110#step/check_after_reboot/15 below.

Is there a follow up to the 4 tests failing due to a product bug? Cannot say. First time I saw these failures was yesterday, and from what I could see, later tests on those scenarios passed.

Should I open a bug for an issue that's already fixed?

  • 1 due to worker performance --> Not related to the test itself.

ok, but which ticket covers this then? Commonly people say "worker performance issue" when tests apply too strict timeouts which need to be changed on the test level. If you find issues where you really think that it's such a bad performance that test changes won't help then please let us know about the specific cases so that we can look into them.

"Test apply too strict timeouts"???? Give me a break!

If the response to any issue presented is always going to be "perhaps you have a wrong setting", "perhaps there is a problem in the test code" or "perhaps you were too aggressive with a timeout", then we will continue to ignore potential issues.

In this particular case:

Command waited for 2 whole minutes before failing. Is 2 minutes too aggressive? I do not think so.

I grant that I may have misspoken when claiming "slow worker" though. On that command, issue could also be network related. Sadly no way to tell from the failure, and of course same test, same scenario, but a later run (several in fact) show passing results: https://openqa.suse.de/tests/7549015

  • 2 due to support server network connectivity issues --> Not related to the test themselves.

but which ticket then?

https://progress.opensuse.org/issues/95788

  • 1 due to failed connection to IBS --> Could be due to a test setting, QAM bot test scheduling, or IBS repository not present.

in that case tests can still be improved with better retrying

Agree.

Well, removing the test (temporarily) until fixed should only be seen as last resort. And to be honest: I repeatedly mention that option as the way I would go to spawn some motivation to fix it ;)

Understood. Hope you succeed and manage to mobilize all the teams required to get to a solution. From my experience, even if QE-SAP are the experts on these scenarios, these random failures usually fall outside of QE-SAP field of expertise, so I do agree and believe that a coordinated effort is required.

#34 Updated by rbranco 7 months ago

  • Due date changed from 2021-10-15 to 2022-01-31

#35 Updated by openqa_review 6 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_beta_node02
https://openqa.suse.de/tests/7671905

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

#36 Updated by openqa_review 6 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_hawk_haproxy_node01
https://openqa.suse.de/tests/7749089

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

#37 Updated by openqa_review 5 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_ctdb_node01
https://openqa.suse.de/tests/7829513

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

#38 Updated by openqa_review 5 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_ctdb_node02
https://openqa.suse.de/tests/7871807

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

#39 Updated by openqa_review 4 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_diskless_sbd_qdevice_node1
https://openqa.suse.de/tests/7976342

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

#40 Updated by jkohoutek 4 months ago

Today I was looking it this again when solving issues with this: https://openqa.suse.de/tests/8012986#step/check_logs/2

From my observation it look like, that ALL jobs which running time reaching 2h fails, but the faster ones around 1h success. Between those it's random, but also usually success: https://openqa.suse.de/tests/8005979#next_previous

Question is, why the same update once took almost 2 hours and fails:

 check_logs   :22413:saptune  3 days ago ( 01:55 hours ) 

but day later it took just a 1 hour and success:

  :22413:saptune  2 days ago ( 01:10 hours ) 

#41 Updated by rbranco 3 months ago

  • Due date changed from 2022-01-31 to 2022-04-30

#42 Updated by openqa_review 3 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_upgrade_migration_node01
https://openqa.suse.de/tests/8192134

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

#44 Updated by okurz 2 months ago

I can confirm. So far this looks good. openqa-query-for-job-label poo#95458 shows

2201470|2022-02-22 04:16:29|done|failed|container-host-microos||openqaworker7
8318522|2022-03-13 03:51:09|done|failed|qam_ha_rolling_update_node01||openqaworker9
8306778|2022-03-10 14:14:56|done|failed|ha_beta_node02||malbec
8296656|2022-03-09 17:30:46|done|failed|ha_beta_node02||QA-Power8-5-kvm
8297435|2022-03-09 12:45:09|done|failed|migration_offline_dvd_verify_sle15sp1_ltss_ha_alpha_node01||openqaworker3
8297914|2022-03-09 10:33:55|done|failed|ha_hawk_haproxy_node02||openqaworker6
8297846|2022-03-09 10:07:52|done|failed|ha_alpha_node01||QA-Power8-5-kvm
8292351|2022-03-09 05:02:30|done|failed|ha_qdevice_node2||QA-Power8-5-kvm
8293695|2022-03-09 04:59:57|done|failed|ha_qdevice_node2||openqaworker-arm-1
8293662|2022-03-09 04:38:34|done|failed|ha_ctdb_node02||openqaworker-arm-2
8293690|2022-03-09 03:52:44|done|failed|ha_priority_fencing_node01||openqaworker-arm-3

so the latest failures were on 2022-03-09

#45 Updated by openqa_review about 2 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_priority_fencing_node01
https://openqa.suse.de/tests/8439544#step/iscsi_client/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

#46 Updated by openqa_review 4 days ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_delta_node02
https://openqa.suse.de/tests/8741156#step/iscsi_client/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 60 days if nothing changes in this ticket.

Also available in: Atom PDF