action #95458
open[qe-sap][ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry
0%
Description
Observation¶
openQA tests in HA scenarios (2 or 3 node clusters) fail in different modules due to unexpected reboots
in one or more of the SUTs:
- QAM HA qdevice node 1, fails in ha_cluster_init module
- QAM HA rolling upgrade migration, node 2, fails in filesystem module
- QAM HA hawk/HAProxy node 1, fails in check_after_reboot module
- QAM 2 nodes, node 1, fails in ha_cluster_init module
Test suite description¶
The base test suite is used for job templates defined in YAML documents. It has no settings of its own.
Reproducible¶
Issue is very sporadic, and reproducing it is not always possible. Usually, re-triggering the jobs lead to the tests passing.
For example, from the jobs above, re-triggered jobs succeeded:
- https://openqa.suse.de/tests/6435165
- https://openqa.suse.de/tests/6435380
- https://openqa.suse.de/tests/6435384
- https://openqa.suse.de/tests/6435389
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#95458
Expected result¶
- Last good: :20349:samba (or more recent)
- Last good: :20121:crmsh
- Last good: 20321:kernel-ec2
- Last good: MR:244261:crmsh
Updated by maritawerner over 3 years ago
- Subject changed from SUT reboots unexpectedly, leading to tests failing in HA scenarios to [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios
Updated by szarate over 3 years ago
- Project changed from openQA Tests (public) to openQA Project (public)
- Category changed from Bugs in existing tests to Regressions/Crashes
Another example can be: https://openqa.suse.de/tests/6410169#step/filesystem/28
After checking the logs, and referencing with the system journal the only rough hint I get is:
Jul 11 01:36:26 openqaworker5 kernel: kvm [38069]: vcpu0, guest rIP: 0xffffffff924776b8 disabled perfctr wrmsr: 0xc2 data 0xffff
which corresponds more or less to the last time the test ran one of those commands... I see no coredumps whatsoever... but that message is a bit puzzling (they repeat every now and then too inside the worker, but other jobs don't have the problem apparently)
PS: Moved to the openQA project for now, although I'm torn between infraestructure or this project itself
Updated by MDoucha over 3 years ago
- Project changed from openQA Project (public) to openQA Tests (public)
- Category deleted (
Regressions/Crashes)
My first guess would be that the test somehow gets into a kernel panic. Add the ignore_level
kernel command line parameter to grub.cfg during incident installation to see kernel backtraces in serial console logs. Here's an example how LTP adds the ignore_level
kernel parameter:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/kernel/install_ltp.pm#L353
Updated by MDoucha over 3 years ago
- Project changed from openQA Tests (public) to openQA Project (public)
- Category set to Regressions/Crashes
Oops, sorry for overwriting some metadata.
Updated by okurz over 3 years ago
- Category changed from Regressions/Crashes to Support
- Status changed from New to Feedback
- Assignee set to okurz
- Target version set to Ready
hm, if there would be a kernel panic than the serial log should show at least something. But the system acts like during a forced power reset. https://openqa.suse.de/tests/6410169/logfile?filename=serial0.txt mentions as last token "Wf8kQ-0-" before the next command is executed but there is nothing after that token in the serial log.
I also manually checked the video from https://openqa.suse.de/tests/6410169/file/video.ogv and stepped through the frames one by one and have not found anything between the healthy bash session like in https://openqa.suse.de/tests/6410169#step/filesystem/28 and the grub menu on boot like https://openqa.suse.de/tests/6410169#step/filesystem/29
According to https://bugs.centos.org/view.php?id=6730 and https://bugzilla.redhat.com/show_bug.cgi?id=507085 messages like "kvm: vcpu0, guest rIP disabled perfctr wrmsr" are considered harmless. I doubt they are related to the problems we see.
@acarvajal I suggest you try the kernel command line parameters that mdoucha suggested. Also you could try if QEMUCPU=host
makes any difference, just a hunch :)
EDIT: If you like others to pickup I suggest you try to come up with "steps to reproduce", e.g. an openqa-cli api -X post isos
command line to trigger a safe set of jobs that do not interfer with production for crosschecking. Then we could potentially also ask someone else from the tools team to take over.
Updated by acarvajal over 3 years ago
okurz wrote:
@acarvajal I suggest you try the kernel command line parameters that mdoucha suggested. Also you could try if
QEMUCPU=host
makes any difference, just a hunch :)
I have added QEMUCPU=host
in the tests that run in the HANA validation openQA instance. Last time there were 3 failures with this issue out of 200+ jobs, so I guess it is a good place to try. I should see some results in the next runs on Monday.
BTW, interesting hunch. I am not seeing this issue in Power9 (100+ jobs in that same openQA instace), which made think that whatever's causing it could be related to qemu. I'll come back with some results next Monday.
I will also begin planning to introduce mdoucha's suggestions to gather more logs for the tests in osd.
Updated by okurz over 3 years ago
acarvajal wrote:
okurz wrote:
@acarvajal I suggest you try the kernel command line parameters that mdoucha suggested. Also you could try if
QEMUCPU=host
makes any difference, just a hunch :)I have added
QEMUCPU=host
in the tests that run in the HANA validation openQA instance. Last time there were 3 failures with this issue out of 200+ jobs, so I guess it is a good place to try. I should see some results in the next runs on Monday.
What I thought of is to not change the production tests but rather trigger an additional, dedicated test set, e.g. following https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation and after the weekend look at the corresponding test overview page to quickly get the overview of how many tests passed/failed.
Updated by acarvajal over 3 years ago
I had to restart 2 tests with the failure yesterday in the HANA validation openQA instance.
So the difference between using QEMUCPU=host
or not was 3 out of 200+ last week to 2 out of 200+ this week. I don't think this is statistically relevant, and the bad news is that the issue is still present.
I would look into implementing mdoucha's suggestions and triggering some additional jobs with and without QEMUCPU=host
to do a more thorough analysis.
Updated by okurz over 3 years ago
- Subject changed from [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios to [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha/filesystem.pm:81.*command.*mkfs -t ocfs2.*timed out":retry
I have seen the symptom
[2021-07-11T01:38:13.388 CEST] [info] ::: basetest::runtest: # Test died: command 'mkfs -t ocfs2 -F -N 16 "/dev/vg_cluster_md/lv_openqa"' timed out at /usr/lib/os-autoinst/testapi.pm line 959.
likely more than once – or I keep getting back to the same job ;) – still, I am trying with auto-review to automatically detect such cases and retrigger according tests. The regex should be specific enough to only catch HAS tests but at any time it can be generalized or extended to cover alternative symptoms. quoting @acarvajal "hope you can find an easy way to detect those cases .... it's not easy as it can happen anytime and in any module during the test" but we should start somewhere :)
Updated by acarvajal over 3 years ago
okurz wrote:
I have seen the symptom
[2021-07-11T01:38:13.388 CEST] [info] ::: basetest::runtest: # Test died: command 'mkfs -t ocfs2 -F -N 16 "/dev/vg_cluster_md/lv_openqa"' timed out at /usr/lib/os-autoinst/testapi.pm line 959.
likely more than once – or I keep getting back to the same job ;) – still, I am trying with auto-review to automatically detect such cases and retrigger according tests. The regex should be specific enough to only catch HAS tests but at any time it can be generalized or extended to cover alternative symptoms. quoting @acarvajal "hope you can find an easy way to detect those cases .... it's not easy as it can happen anytime and in any module during the test" but we should start somewhere :)
EDIT: Tested with
echo https://openqa.suse.de/tests/6410236 | env host=openqa.suse.de openqa-investigate
with openqa-investigate from github.com/os-autoinst/scripts/ . But openqa-investigate currently does not support cloning jobs that are part of a multi-machine cluster, see https://github.com/os-autoinst/scripts/commit/371467dafcefb9182530c790c33632f8cfa9a297#diff-f73cf39a07f6cf8cdb453862496919d06df16d07e58b274e68ea148dd1f7dae5
That's one of the symptoms.
I'd say whenever SUT is unexpectedly rebooted, tests will fail in one of two ways depending on what the test module was doing:
- A time out in an assert_script_run (or similar), such as the sympton in this filesystem test module failure.
- A failure in assert_screen such as in https://openqa.suse.de/tests/6426340#step/check_after_reboot/5
Since the majority of the HA test modules rely either on the root_console or the serial terminal, I think the first case will be more common, but I don't know if having a general rule to re-start tests when commands time out is safe.
Updated by okurz over 3 years ago
- Subject changed from [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha/filesystem.pm:81.*command.*mkfs -t ocfs2.*timed out":retry to [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*command.*mkfs -t ocfs2.*timed out.*match=root-console timed out":retry
Updated by okurz over 3 years ago
Tested the auto-review regex with
$ echo https://openqa.suse.de/tests/6410225 | env dry_run=1 host=openqa.suse.de ./openqa-label-known-issues
openqa-cli api --host https://openqa.suse.de -X POST jobs/6410225/comments text=poo#95458 [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*command.*mkfs -t ocfs2.*timed out.*match=root-console timed out":retry
openqa-cli api --host https://openqa.suse.de -X POST jobs/6410225/restart
so a comment would have been written and the test should have been restarted, assuming this works this way over the API for multi-machine clusters.
Updated by okurz over 3 years ago
- Subject changed from [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*command.*mkfs -t ocfs2.*timed out.*match=root-console timed out":retry to [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry
Updated by okurz over 3 years ago
- Related to action #95788: [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network added
Updated by okurz over 3 years ago
- Project changed from openQA Project (public) to openQA Tests (public)
- Subject changed from [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry to [ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry
- Category changed from Support to Bugs in existing tests
- Status changed from Feedback to Workable
- Assignee changed from okurz to acarvajal
the auto-review regex matching might be a bit too broad as it also catches issues like https://openqa.suse.de/tests/6581119#step/iscsi_client/18 where the test fails in iscsi and then also the post_fail_hook fails to select a free root terminal. However this is all within the scope of tests/ha so I will leave this to you again. As followup to #95458#note-2 , sorry, I don't see how this is a problem with openQA itself.
Updated by MDoucha over 3 years ago
MDoucha wrote:
My first guess would be that the test somehow gets into a kernel panic. Add the
ignore_level
kernel command line parameter to grub.cfg during incident installation to see kernel backtraces in serial console logs. Here's an example how LTP adds theignore_level
kernel parameter:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/kernel/install_ltp.pm#L353
Correction of my suggestion: the kernel command line parameter is actually ignore_loglevel
. Also updating the link above to permalink:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/49342da7528b6bc0a8b418090487bc40c7f8e4ce/tests/kernel/install_ltp.pm#L359
Updated by okurz over 3 years ago
- Due date changed from 2021-07-30 to 2021-08-06
- Target version changed from Ready to future
@acarvajal are you ok to continue here?
Updated by okurz over 3 years ago
- Related to action #94171: [qem][sap] test fails in check_logs about 50% of times added
Updated by acarvajal over 3 years ago
- Project changed from openQA Tests (public) to openQA Project (public)
- Due date changed from 2021-08-06 to 2021-09-10
- Category deleted (
Bugs in existing tests) - Status changed from Workable to Feedback
- Assignee changed from acarvajal to okurz
- Target version changed from future to Ready
okurz wrote:
@acarvajal are you ok to continue here?
Yes. I think so. I will probably sync with you before doing so though.
Updated by acarvajal over 3 years ago
- Project changed from openQA Project (public) to openQA Tests (public)
- Category set to Bugs in existing tests
- Status changed from Feedback to Workable
- Assignee changed from okurz to acarvajal
- Target version changed from Ready to future
Updated by szarate over 3 years ago
- Related to action #97013: [qe-core][qe-yast] test fails in handle_reboot, patch_and_reboot, installation added
Updated by openqa_review over 3 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: migration_offline_scc_verify_sle15sp1_ltss_ha_alpha_node02
https://openqa.suse.de/tests/6958789
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The label in the openQA scenario is removed
Updated by okurz over 3 years ago
- Subject changed from [ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry to [qe-sap][ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry
Using keyword "qe-sap" as verified by jmichel in weekly QE sync 2021-09-15
Updated by okurz over 3 years ago
this ticket is exceeding its due-date. It popped up during the weekly QE sync 2021-09-22. We would appreciate a reaction within the next days, at least updating the due-date according to what we can realistically expect. See https://progress.opensuse.org/projects/openqatests/wiki#SLOs-service-level-objectives for details
Updated by acarvajal over 3 years ago
- Due date changed from 2021-09-10 to 2021-10-15
Updated by openqa_review about 3 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: ha_qdevice_node2
https://openqa.suse.de/tests/7350101
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Updated by openqa_review about 3 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: ha_qdevice_node2
https://openqa.suse.de/tests/7393110
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Updated by okurz about 3 years ago
so we discussed in a meeting. Thank you for inviting me :)
As this is a ticket about the tests themselves I suggest that you within qe-sap drive the work. And from the tools team we will be able to provide support and stand ready for collaboration. If you don't plan to have the issue fixed soon the according tests can also be unscheduled according to the QAM processes until (hopefully) eventually the tests can be brought back. One additional observation from my side: Some months ago we still had highly stable multi-machine tests for the areas "network" as well as "HPC" and "SAP" but it seems all three areas have not received a lot of love. There are multiple other areas where multi-machine tests are running just fine so I am not aware of any generic openQA problems. What I see are limited to aforementioned areas hence it's unlikely that the issues will go away until explicitly addressed because they are domain-specific. Based on my former experiences those issues could very well point to valid product regressions.
And https://openqa.suse.de/tests?match=wicked_ shows that openQA multi-machine tests relying on the network can work very reliably. No network related problem showing up there. A quick SQL query revealed that for the fail ratio of "wicked" tests within the last 10 days is 0.6% so very low. This shows that openQA multi-machine tests can be very stable and also shows that we don't have a generic problem in our tooling or infrastructure. Regardless if it's sporadic product issues or test design flaws at this point I recommend to focus on mitigating the negative impact on the openQA review procedures in short-term and remove the according tests from the schedule. Also see https://confluence.suse.com/display/openqa/QAM+openQA+review+guide as reference. During the time the tests are not within the validation schedule QE-SAP would need to ensure by other means that the product quality is sufficient.
@acarvajal does it make sense for you to stay assigned to this ticket?
@rbranco as direct followup to what we just discussed, have you seen #95458#note-29 ? I suggest to remove ha_qdevice_node2 and all related test suites from the validation job groups until the problem can be fixed. Can you create an according merge request for the test schedule in https://gitlab.suse.de/qa-maintenance/qam-openqa-yml ?
Updated by acarvajal about 3 years ago
- Assignee changed from acarvajal to rbranco
okurz wrote:
And https://openqa.suse.de/tests?match=wicked_ shows that openQA multi-machine tests relying on the network can work very reliably. No network related problem showing up there.
This issue is definitely not network-related. 100% agreement with you there. The ticket that was tracking network issues in HA QAM tests is: https://progress.opensuse.org/issues/95788
This ticket tracks instead the sporadic failure where the SUT VM reboots when test code is not expecting it, and even if this is more frequent in MM scenarios, it can also happen on SM scenarios.
And this is not limited only to osd. In openqa.wdf.sap.corp I've noticed the same issue in:
- 9 jobs out of 384 jobs total for this week.
- 4 jobs out of 384 jobs total for the week starting on 18.10.2021.
- 4 jobs out of 384 jobs total for the week starting on 11.10.2021.
- 12 jobs out of 384 jobs total for the week starting on 4.10.2021
All these jobs are running with QEMUCPU=host
since it was suggested in https://progress.opensuse.org/issues/95458#note-5, but even without that setting, failure rate was more or less the same. It is a low failure rate though, at around 1.8%.
Restarting all these tests results in them passing. Hence no other assumption to make except that a transient race condition is to blame for these failures, and not something related to the tests themselves.
A quick SQL query revealed that for the fail ratio of "wicked" tests within the last 10 days is 0.6% so very low.
What would be the fail ratio of "qdevice" and "qnetd" in the same period? Can you point me to where/how I can get that data myself?
@acarvajal does it make sense for you to stay assigned to this ticket?
I do not think so. I asked @jmichel who should I assign it to, so I am assigning this to @rbranco.
@rbranco as direct followup to what we just discussed, have you seen #95458#note-29 ? I suggest to remove ha_qdevice_node2 and all related test suites from the validation job groups until the problem can be fixed. Can you create an according merge request for the test schedule in https://gitlab.suse.de/qa-maintenance/qam-openqa-yml ?
Following up on the "wicked" exercise from above, if we go to: https://openqa.suse.de/tests?match=ha_qdevice
We'll see that out of the last 500 qdevice jobs finished in osd, there are at the time of this writing:
343 passed, 109 softfailed, 12 failed, 6 skipped, 3 incomplete and 27 parallel failed
All 3 incompletes are failures while downloading the qcow2 image. Checking the Next & Previous tab in all these 3 tests shows that these have passed earlier or later, so the only explanation that I have for the missing qcow2 image failure is that the qdevice test started after osd had already cleaned the asset from the storage.
The 12 failures amount to a 2.4% failure rate. This percentage increases to 7.8% if removing the skipped tests and adding the parallel failed ones.
However, if qdevice is a 2 node test, why are there so many more parallel failed jobs than failed jobs? The explanation lies on the failed qnetd jobs that run in parallel to the qdevice nodes. Checking https://openqa.suse.de/tests?match=ha_qnetd shows 6 failures within the last 500 jobs, so that accounts for an extra 12 parallel failed jobs.
But then when we get to why these tests failed, we see that:
- Of the 6 qnetd failures, 5 were due to issues in the screen: https://openqa.suse.de/tests/7507523, https://openqa.suse.de/tests/7446239, https://openqa.suse.de/tests/7443590, https://openqa.suse.de/tests/7439788, https://openqa.suse.de/tests/7387987.
- The other one, failed connecting to IBS repositories: https://openqa.suse.de/tests/7454688#step/qnetd/28
- Of the 12 qdevice failures, 3 were also due to issues in the screen: https://openqa.suse.de/tests/7444348, https://openqa.suse.de/tests/7448601, https://openqa.suse.de/tests/7450384
- 4 seems to be due to a product bug ... a missing binary in the image: https://openqa.suse.de/tests/7519115#step/iscsi_client/13, https://openqa.suse.de/tests/7518128#step/iscsi_client/13, https://openqa.suse.de/tests/7511904#step/iscsi_client/13, https://openqa.suse.de/tests/7511001#step/iscsi_client/13
- 2 due to connectivity issues with the support server network: https://openqa.suse.de/tests/7478390#step/qnetd/22, https://openqa.suse.de/tests/7460365#step/ha_cluster_join/11
- 1 due to a slow worker: https://openqa.suse.de/tests/7456662#step/ha_cluster_join/13
- The other 2 due to an HA dependency not starting: https://openqa.suse.de/tests/7509152#step/check_after_reboot/15, https://openqa.suse.de/tests/7451349#step/ha_cluster_init/17
So, out of 1000 jobs qdevice/qnetd tests:
- 18 failures.
- 8 of those due to screen rendering --> Not related to the tests themselves.
- 4 possibly due to a product bug
- 1 due to worker performance --> Not related to the test itself.
- 2 due to support server network connectivity issues --> Not related to the test themselves.
- 1 due to failed connection to IBS --> Could be due to a test setting, QAM bot test scheduling, or IBS repository not present.
- 2 due to HA dependencies not starting --> Could be product issue. Could be worker performance.
Not a single failure related to this ticket. And more than that, not enough grounds for removing these tests from the schedule.
No idea why the focus on the qdevice/qnetd tests when this issue was opened for several different scenarios.
I also insist that decreasing test coverage should not be the approach, but I will refrain from beating this dead horse.
Updated by okurz about 3 years ago
acarvajal wrote:
[…]
I appreciate your thorough analysis and all of it up to this point is completely correctly evaluated. But still we should react on the problem at hand: QE-SAP and/or HA tests sometimes "randomly fail", e.g. due to the reported "spontaneous reboot issues". The consequence is that qa-maintenance/bot-ng will not auto-approve according SLE maintenance updates and the group of openQA maintenance test reviewers that should merely "coordinate" are asked for help to move the according SLE maintenance updates forward. 1. These test reviewers should not even need to do that job because the team qe-sap should do it (see https://confluence.suse.com/display/qasle/openQA+QE+Maintenance+Review ) and 2. The openQA maintenance test reviewers do not have the capacity to fix all the different test instabilities themselves. So based on which individual issues you identified let me provide some specific questions or suggestions:
So, out of 1000 jobs qdevice/qnetd tests:
- 18 failures.
- 8 of those due to screen rendering --> Not related to the tests themselves.
even if that is the case. Is there a ticket to improve that situation?
- 4 possibly due to a product bug
so where is the follow-up for that? Just the job that was mentioned by openqa-review https://openqa.suse.de/tests/7393110#step/check_after_reboot/15 shows that a service fails to start up. If all cases would be like these then we should be be good. But this only works if everybody sees the test failures the same way.
- 1 due to worker performance --> Not related to the test itself.
ok, but which ticket covers this then? Commonly people say "worker performance issue" when tests apply too strict timeouts which need to be changed on the test level. If you find issues where you really think that it's such a bad performance that test changes won't help then please let us know about the specific cases so that we can look into them.
- 2 due to support server network connectivity issues --> Not related to the test themselves.
but which ticket then?
- 1 due to failed connection to IBS --> Could be due to a test setting, QAM bot test scheduling, or IBS repository not present.
in that case tests can still be improved with better retrying
[…]
Not a single failure related to this ticket. And more than that, not enough grounds for removing these tests from the schedule.
No idea why the focus on the qdevice/qnetd tests when this issue was opened for several different scenarios.
I came back to this ticket for two reasons: Because there is https://progress.opensuse.org/issues/95458#note-29 pointing to a test failing due to the "test issue" described in this ticket (even though this might not be true but the test is labeled like that). And second because the due-date was exceeded by more than ten days so violating https://progress.opensuse.org/projects/openqatests/wiki#SLOs-service-level-objectives hence I noticed it.
I also insist that decreasing test coverage should not be the approach, but I will refrain from beating this dead horse.
Well, removing the test (temporarily) until fixed should only be seen as last resort. And to be honest: I repeatedly mention that option as the way I would go to spawn some motivation to fix it ;)
Updated by acarvajal about 3 years ago
okurz wrote:
- These test reviewers should not even need to do that job because the team qe-sap should do it (see https://confluence.suse.com/display/qasle/openQA+QE+Maintenance+Review ) and 2. The openQA maintenance test reviewers do not have the capacity to fix all the different test instabilities themselves.
No disagreements there. What I disagree on - and what I have repeated more times than I care to count - is that the solution should never be dropping coverage. Even if as you say "randomly failing" tests impact automatic reviews so much, it's a practice that leads to a false sense of security. Even when the practice is followed on, it can lead to unacceptable drops in coverage (for example, see https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/merge_requests/150 & https://progress.opensuse.org/issues/68932).
So based on which individual issues you identified let me provide some specific questions or suggestions:
OK. Feels like you're dodging my argument which was "why remove from the schedule a test that has a success rate of over 90%?", but let's go ahead.
So, out of 1000 jobs qdevice/qnetd tests:
- 18 failures.
- 8 of those due to screen rendering --> Not related to the tests themselves.
even if that is the case. Is there a ticket to improve that situation?
No idea. Is it?
- 4 possibly due to a product bug
so where is the follow-up for that? Just the job that was mentioned by openqa-review https://openqa.suse.de/tests/7393110#step/check_after_reboot/15 shows that a service fails to start up. If all cases would be like these then we should be be good. But this only works if everybody sees the test failures the same way.
Huh? I did mention https://openqa.suse.de/tests/7393110#step/check_after_reboot/15 below.
Is there a follow up to the 4 tests failing due to a product bug? Cannot say. First time I saw these failures was yesterday, and from what I could see, later tests on those scenarios passed.
Should I open a bug for an issue that's already fixed?
- 1 due to worker performance --> Not related to the test itself.
ok, but which ticket covers this then? Commonly people say "worker performance issue" when tests apply too strict timeouts which need to be changed on the test level. If you find issues where you really think that it's such a bad performance that test changes won't help then please let us know about the specific cases so that we can look into them.
"Test apply too strict timeouts"???? Give me a break!
If the response to any issue presented is always going to be "perhaps you have a wrong setting", "perhaps there is a problem in the test code" or "perhaps you were too aggressive with a timeout", then we will continue to ignore potential issues.
In this particular case:
- https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/lib/hacluster.pm#L91
"TIMEOUT_SCALE" : 2,
from https://openqa.suse.de/tests/7456662/file/vars.json
Command waited for 2 whole minutes before failing. Is 2 minutes too aggressive? I do not think so.
I grant that I may have misspoken when claiming "slow worker" though. On that command, issue could also be network related. Sadly no way to tell from the failure, and of course same test, same scenario, but a later run (several in fact) show passing results: https://openqa.suse.de/tests/7549015
- 2 due to support server network connectivity issues --> Not related to the test themselves.
but which ticket then?
https://progress.opensuse.org/issues/95788
- 1 due to failed connection to IBS --> Could be due to a test setting, QAM bot test scheduling, or IBS repository not present.
in that case tests can still be improved with better retrying
Agree.
Well, removing the test (temporarily) until fixed should only be seen as last resort. And to be honest: I repeatedly mention that option as the way I would go to spawn some motivation to fix it ;)
Understood. Hope you succeed and manage to mobilize all the teams required to get to a solution. From my experience, even if QE-SAP are the experts on these scenarios, these random failures usually fall outside of QE-SAP field of expertise, so I do agree and believe that a coordinated effort is required.
Updated by rbranco about 3 years ago
- Due date changed from 2021-10-15 to 2022-01-31
Updated by openqa_review about 3 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: ha_beta_node02
https://openqa.suse.de/tests/7671905
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Updated by openqa_review about 3 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: qam_ha_hawk_haproxy_node01
https://openqa.suse.de/tests/7749089
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Updated by openqa_review about 3 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: ha_ctdb_node01
https://openqa.suse.de/tests/7829513
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Updated by openqa_review about 3 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: ha_ctdb_node02
https://openqa.suse.de/tests/7871807
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Updated by openqa_review almost 3 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: ha_diskless_sbd_qdevice_node1
https://openqa.suse.de/tests/7976342
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Updated by jkohoutek almost 3 years ago
Today I was looking it this again when solving issues with this: https://openqa.suse.de/tests/8012986#step/check_logs/2
From my observation it look like, that ALL jobs which running time reaching 2h fails, but the faster ones around 1h success. Between those it's random, but also usually success: https://openqa.suse.de/tests/8005979#next_previous
Question is, why the same update once took almost 2 hours and fails:
check_logs :22413:saptune 3 days ago ( 01:55 hours )
but day later it took just a 1 hour and success:
:22413:saptune 2 days ago ( 01:10 hours )
Updated by rbranco almost 3 years ago
- Due date changed from 2022-01-31 to 2022-04-30
Updated by openqa_review almost 3 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: qam_ha_rolling_upgrade_migration_node01
https://openqa.suse.de/tests/8192134
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Updated by rbranco almost 3 years ago
- Status changed from Workable to Resolved
Updated by okurz almost 3 years ago
I can confirm. So far this looks good. openqa-query-for-job-label poo#95458
shows
2201470|2022-02-22 04:16:29|done|failed|container-host-microos||openqaworker7
8318522|2022-03-13 03:51:09|done|failed|qam_ha_rolling_update_node01||openqaworker9
8306778|2022-03-10 14:14:56|done|failed|ha_beta_node02||malbec
8296656|2022-03-09 17:30:46|done|failed|ha_beta_node02||QA-Power8-5-kvm
8297435|2022-03-09 12:45:09|done|failed|migration_offline_dvd_verify_sle15sp1_ltss_ha_alpha_node01||openqaworker3
8297914|2022-03-09 10:33:55|done|failed|ha_hawk_haproxy_node02||openqaworker6
8297846|2022-03-09 10:07:52|done|failed|ha_alpha_node01||QA-Power8-5-kvm
8292351|2022-03-09 05:02:30|done|failed|ha_qdevice_node2||QA-Power8-5-kvm
8293695|2022-03-09 04:59:57|done|failed|ha_qdevice_node2||openqaworker-arm-1
8293662|2022-03-09 04:38:34|done|failed|ha_ctdb_node02||openqaworker-arm-2
8293690|2022-03-09 03:52:44|done|failed|ha_priority_fencing_node01||openqaworker-arm-3
so the latest failures were on 2022-03-09
Updated by openqa_review over 2 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: ha_priority_fencing_node01
https://openqa.suse.de/tests/8439544#step/iscsi_client/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
Updated by openqa_review over 2 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: ha_delta_node02
https://openqa.suse.de/tests/8741156#step/iscsi_client/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 60 days if nothing changes in this ticket.
Updated by bschmidt about 2 years ago
- Status changed from Resolved to In Progress
unfortunately this happens again :-(
see https://openqa.suse.de/tests/9679509#step/check_after_reboot/30
Updated by slo-gin about 2 years ago
This ticket is 10 days after the due-date. Please consider closing this ticket or move the due-date accordingly.
Updated by rbranco about 2 years ago
I vote for closing this ticket as the issue has nothing to do with SAP/HA.
Updated by slo-gin about 2 years ago
This ticket is 10 days after the due-date. Please consider closing this ticket or move the due-date accordingly.
Updated by acarvajal about 2 years ago
rbranco wrote:
I vote for closing this ticket as the issue has nothing to do with SAP/HA.
Is the issue gone? Judging by https://progress.opensuse.org/issues/95458#note-47 it isn't.
Before closing I would vote for re-assignment.
Updated by slo-gin about 2 years ago
This ticket is 10 days after the due-date. Please consider closing this ticket or move the due-date accordingly.
Updated by rbranco about 2 years ago
acarvajal wrote:
rbranco wrote:
I vote for closing this ticket as the issue has nothing to do with SAP/HA.
Is the issue gone? Judging by https://progress.opensuse.org/issues/95458#note-47 it isn't.
Before closing I would vote for re-assignment.
This poo is too generic IMHO. Can you please reassign to someone? I will be in squad rotation in November until February.
Updated by slo-gin about 2 years ago
This ticket is 10 days after the due-date. Please consider closing this ticket or move the due-date accordingly.
Updated by okurz about 2 years ago
- Due date deleted (
2022-04-30)
This ticket had a due set but exceeded it already by more than 14 days. We would like to take the due date seriously so please update the ticket accordingly (resolve the ticket or update the due-date or remove the due-date). See https://progress.opensuse.org/projects/openqatests/wiki/Wiki#SLOs-service-level-objectives for details.
Updated by openqa_review about 2 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: migration_offline_dvd_verify_sle12sp5_ha_alpha_node01_atmg
https://openqa.suse.de/tests/9918083#step/migrate_clvmd_to_lvmlockd/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 44 days if nothing changes in this ticket.
Updated by openqa_review about 2 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: migration_offline_dvd_verify_sle15sp2_ltss_ha_alpha_node02
https://openqa.suse.de/tests/10027954#step/check_after_reboot/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
Updated by vsvecova about 2 years ago
Hello @fgerling, I'm wondering whether there is any update about this issue? It has been blocking the autoapproval for quite a few maintenance updates lately.
Updated by fgerling about 2 years ago
- Status changed from In Progress to Feedback
The last comments from Alvaro and Ricardo are mentioning that it is a non SAP specific issue. Is there a process to hand it over?
I requested feedback from PO, in regards to priority, and will update here when I get an answer.
Updated by vsvecova about 2 years ago
I'm not sure how you define specific, but I don't recall seeing this issue anywhere else than HA-related jobs. I'm not aware of any ticket hand-over process; I guess the rule of thumb has always been that the squad whose tests are failing is also responsible for the fix. In any case, I'm pondering about the usefulness of a test that fails so often. Wouldn't it make more sense to just unschedule them?
Updated by LMartin about 2 years ago
- Assignee changed from fgerling to bschmidt
This ticket describes SUT reboots which are unexpected and sporadic, e.g. #note-31 and #note-32 from a year ago. Indeed, it was so sporadic back then that it was hard to find a reproducer.
However looking at the recent fails from 15 SP5 in this ticket ( https://openqa.suse.de/tests/10027954#next_previous and https://openqa.suse.de/tests/10027954#next_previous ) those new failures are very frequent. So those two cases are either the reproducer which has been asked for in this ticket, or an actual product bug which needs attention.
Miura and Birger: can you please check and give feedback here if https://openqa.suse.de/tests/10027954#next_previous and https://openqa.suse.de/tests/10027954#next_previous are real issues or the sporadic unexpected reboots described here.
Regarding autoapprovals of maintenance updates I have asked Ednilson Miura and Birger Schmidt from QE-SAP to keep an extra eye http://dashboard.qam.suse.de/blocked to make sure SAP/HA tests are not unnecessarily blocking updates due to these sporadic SUT reboots, e.g. those tests should (as a workaround) be retriggered & assessed to see if they are real issues or these sporadic reboots. If you see QE-SAP blocking maintenance updates, please feel free to reach out and ask for the time being.
For a longer term solution I need to verify with Alvaro when he returns from vacation. I need to understand if there is a commonality with these failures, e.g. always migration or some specific HA test(s)? or only migration etc.
In 2023 we can for sure migrate the tests to new workers, but based on what I read above, the resolution is probably not that simple.
And no, unscheduling tests does not make sense in my view. Fixing broken tests makes absolute sense though.
Updated by openqa_review almost 2 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: migration_offline_dvd_verify_sle15sp2_ltss_ha_alpha_node02
https://openqa.suse.de/tests/10218340#step/check_after_reboot/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
Updated by tinita almost 2 years ago
I can see that grep sometimes times out with the regex in the title.
Since (?s)
is used, every .*
can span over multiple lines, and that involves a lot of backtracking and might not be needed. Please consider to change some of the .*
to [^\n]*
, or drop the (?s)
and change the .*
that has to be line spanning to [\S\s]*
.
Updated by openqa_review almost 2 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: migration_offline_scc_verify_sle12sp4_ltss_ha_alpha_node02
https://openqa.suse.de/tests/10562975#step/migrate_clvmd_to_lvmlockd/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 80 days if nothing changes in this ticket.
Updated by openqa_review over 1 year ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: xfstests_xfs-xfs-reflink
https://openqa.suse.de/tests/11162974#step/run/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 160 days if nothing changes in this ticket.
Updated by acarvajal over 1 year ago
- Assignee changed from bschmidt to acarvajal
Updated by acarvajal over 1 year ago
- Category changed from Bugs in existing tests to Infrastructure
- Status changed from Feedback to Closed
Closing this until issue is seen again to reference more recent jobs.
Updated by openqa_review about 1 year ago
- Status changed from Closed to Feedback
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: ha_ctdb_node01
https://openqa.suse.de/tests/12857331#step/iscsi_client/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 40 days if nothing changes in this ticket.
Updated by openqa_review 7 months ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: migration_offline_scc_verify_sle15sp3_ha_alpha_node02
https://openqa.suse.de/tests/14373096#step/check_after_reboot/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 316 days if nothing changes in this ticket.