action #95458: [qe-sap][ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry - openQA Tests (public) - openSUSE Project Management Tool

Custom queries

All open Feature tests
openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QAM
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE Tools Team - Beginner
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE Tools Team - Expert
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
SLE15 Migration Open Tickets
SLE15 SP1 Migration Open Tickets
SLE15SP3 Migration open ticket
SLE15SP3 Security open ticket
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #95458

open

[qe-sap][ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.(command.timed out|Test died).*match=root-console timed out":retry

Added by acarvajal almost 4 years ago. Updated about 1 month ago.

Status:

Feedback

Priority:

Normal

Assignee:

acarvajal

Category:

Infrastructure

Target version:

QA (public) - future

Start date:

2021-07-13

Due date:

% Done:

Estimated time:

Difficulty:

Description

Observation¶

openQA tests in HA scenarios (2 or 3 node clusters) fail in different modules due to unexpected reboots
in one or more of the SUTs:

QAM HA qdevice node 1, fails in ha_cluster_init module
QAM HA rolling upgrade migration, node 2, fails in filesystem module
QAM HA hawk/HAProxy node 1, fails in check_after_reboot module
QAM 2 nodes, node 1, fails in ha_cluster_init module

Test suite description¶

The base test suite is used for job templates defined in YAML documents. It has no settings of its own.

Reproducible¶

Issue is very sporadic, and reproducing it is not always possible. Usually, re-triggering the jobs lead to the tests passing.

For example, from the jobs above, re-triggered jobs succeeded:

https://openqa.suse.de/tests/6435165
https://openqa.suse.de/tests/6435380
https://openqa.suse.de/tests/6435384
https://openqa.suse.de/tests/6435389

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#95458

Expected result¶

Last good: :20349:samba (or more recent)
Last good: :20121:crmsh
Last good: 20321:kernel-ec2
Last good: MR:244261:crmsh

Related issues 3 (1 open — 2 closed)

Related to openQA Tests (public) - action #95788: [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network

Feedback

2021-07-21

Actions

Related to openQA Tests (public) - action #94171: [qem][sap] test fails in check_logs about 50% of times

Rejected

2021-06-17

Actions

Related to openQA Tests (public) - action #97013: [qe-core][qe-yast] test fails in handle_reboot, patch_and_reboot, installation

Resolved

2021-08-17

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by maritawerner almost 4 years ago

Subject changed from SUT reboots unexpectedly, leading to tests failing in HA scenarios to [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios

Actions

Copy link

Updated by szarate almost 4 years ago

Project changed from openQA Tests (public) to openQA Project (public)
Category changed from Bugs in existing tests to Regressions/Crashes

Another example can be: https://openqa.suse.de/tests/6410169#step/filesystem/28

After checking the logs, and referencing with the system journal the only rough hint I get is:

Jul 11 01:36:26 openqaworker5 kernel: kvm [38069]: vcpu0, guest rIP: 0xffffffff924776b8 disabled perfctr wrmsr: 0xc2 data 0xffff

which corresponds more or less to the last time the test ran one of those commands... I see no coredumps whatsoever... but that message is a bit puzzling (they repeat every now and then too inside the worker, but other jobs don't have the problem apparently)

PS: Moved to the openQA project for now, although I'm torn between infraestructure or this project itself

Actions

Copy link

Updated by MDoucha almost 4 years ago

Project changed from openQA Project (public) to openQA Tests (public)
Category deleted (~~Regressions/Crashes~~)

My first guess would be that the test somehow gets into a kernel panic. Add the ignore_level kernel command line parameter to grub.cfg during incident installation to see kernel backtraces in serial console logs. Here's an example how LTP adds the ignore_level kernel parameter:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/kernel/install_ltp.pm#L353

Actions

Copy link

Updated by MDoucha almost 4 years ago

Project changed from openQA Tests (public) to openQA Project (public)
Category set to Regressions/Crashes

Oops, sorry for overwriting some metadata.

Actions

Copy link

Updated by okurz almost 4 years ago

Category changed from Regressions/Crashes to Support
Status changed from New to Feedback
Assignee set to okurz
Target version set to Ready

hm, if there would be a kernel panic than the serial log should show at least something. But the system acts like during a forced power reset. https://openqa.suse.de/tests/6410169/logfile?filename=serial0.txt mentions as last token "Wf8kQ-0-" before the next command is executed but there is nothing after that token in the serial log.

I also manually checked the video from https://openqa.suse.de/tests/6410169/file/video.ogv and stepped through the frames one by one and have not found anything between the healthy bash session like in https://openqa.suse.de/tests/6410169#step/filesystem/28 and the grub menu on boot like https://openqa.suse.de/tests/6410169#step/filesystem/29

According to https://bugs.centos.org/view.php?id=6730 and https://bugzilla.redhat.com/show_bug.cgi?id=507085 messages like "kvm: vcpu0, guest rIP disabled perfctr wrmsr" are considered harmless. I doubt they are related to the problems we see.

@acarvajal I suggest you try the kernel command line parameters that mdoucha suggested. Also you could try if QEMUCPU=host makes any difference, just a hunch :)

EDIT: If you like others to pickup I suggest you try to come up with "steps to reproduce", e.g. an openqa-cli api -X post isos command line to trigger a safe set of jobs that do not interfer with production for crosschecking. Then we could potentially also ask someone else from the tools team to take over.

Actions

Copy link

Updated by okurz almost 4 years ago

Due date set to 2021-07-30

Actions

Copy link

Updated by acarvajal almost 4 years ago

okurz wrote:

@acarvajal I suggest you try the kernel command line parameters that mdoucha suggested. Also you could try if QEMUCPU=host makes any difference, just a hunch :)

I have added QEMUCPU=host in the tests that run in the HANA validation openQA instance. Last time there were 3 failures with this issue out of 200+ jobs, so I guess it is a good place to try. I should see some results in the next runs on Monday.

BTW, interesting hunch. I am not seeing this issue in Power9 (100+ jobs in that same openQA instace), which made think that whatever's causing it could be related to qemu. I'll come back with some results next Monday.

I will also begin planning to introduce mdoucha's suggestions to gather more logs for the tests in osd.

Actions

Copy link

Updated by okurz almost 4 years ago

acarvajal wrote:

okurz wrote:

@acarvajal I suggest you try the kernel command line parameters that mdoucha suggested. Also you could try if QEMUCPU=host makes any difference, just a hunch :)

I have added QEMUCPU=host in the tests that run in the HANA validation openQA instance. Last time there were 3 failures with this issue out of 200+ jobs, so I guess it is a good place to try. I should see some results in the next runs on Monday.

What I thought of is to not change the production tests but rather trigger an additional, dedicated test set, e.g. following https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation and after the weekend look at the corresponding test overview page to quickly get the overview of how many tests passed/failed.

Actions

Copy link

Updated by acarvajal almost 4 years ago

I had to restart 2 tests with the failure yesterday in the HANA validation openQA instance.

So the difference between using QEMUCPU=host or not was 3 out of 200+ last week to 2 out of 200+ this week. I don't think this is statistically relevant, and the bad news is that the issue is still present.

I would look into implementing mdoucha's suggestions and triggering some additional jobs with and without QEMUCPU=host to do a more thorough analysis.

Actions

Copy link

#10

Updated by okurz almost 4 years ago

Subject changed from [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios to [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha/filesystem.pm:81.*command.*mkfs -t ocfs2.*timed out":retry

I have seen the symptom

[2021-07-11T01:38:13.388 CEST] [info] ::: basetest::runtest: # Test died: command 'mkfs -t ocfs2 -F -N 16 "/dev/vg_cluster_md/lv_openqa"' timed out at /usr/lib/os-autoinst/testapi.pm line 959.

likely more than once – or I keep getting back to the same job ;) – still, I am trying with auto-review to automatically detect such cases and retrigger according tests. The regex should be specific enough to only catch HAS tests but at any time it can be generalized or extended to cover alternative symptoms. quoting @acarvajal "hope you can find an easy way to detect those cases .... it's not easy as it can happen anytime and in any module during the test" but we should start somewhere :)

Actions

Copy link

#11

Updated by acarvajal almost 4 years ago

okurz wrote:

I have seen the symptom
[2021-07-11T01:38:13.388 CEST] [info] ::: basetest::runtest: # Test died: command 'mkfs -t ocfs2 -F -N 16 "/dev/vg_cluster_md/lv_openqa"' timed out at /usr/lib/os-autoinst/testapi.pm line 959.
likely more than once – or I keep getting back to the same job ;) – still, I am trying with auto-review to automatically detect such cases and retrigger according tests. The regex should be specific enough to only catch HAS tests but at any time it can be generalized or extended to cover alternative symptoms. quoting @acarvajal "hope you can find an easy way to detect those cases .... it's not easy as it can happen anytime and in any module during the test" but we should start somewhere :)

EDIT: Tested with echo https://openqa.suse.de/tests/6410236 | env host=openqa.suse.de openqa-investigate with openqa-investigate from github.com/os-autoinst/scripts/ . But openqa-investigate currently does not support cloning jobs that are part of a multi-machine cluster, see https://github.com/os-autoinst/scripts/commit/371467dafcefb9182530c790c33632f8cfa9a297#diff-f73cf39a07f6cf8cdb453862496919d06df16d07e58b274e68ea148dd1f7dae5

That's one of the symptoms.

I'd say whenever SUT is unexpectedly rebooted, tests will fail in one of two ways depending on what the test module was doing:

A time out in an assert_script_run (or similar), such as the sympton in this filesystem test module failure.
A failure in assert_screen such as in https://openqa.suse.de/tests/6426340#step/check_after_reboot/5

Since the majority of the HA test modules rely either on the root_console or the serial terminal, I think the first case will be more common, but I don't know if having a general rule to re-start tests when commands time out is safe.

Actions

Copy link

#12

Updated by okurz almost 4 years ago

Subject changed from [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha/filesystem.pm:81.*command.*mkfs -t ocfs2.*timed out":retry to [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*command.*mkfs -t ocfs2.*timed out.*match=root-console timed out":retry

Actions

Copy link

#13

Updated by okurz almost 4 years ago

Tested the auto-review regex with

$ echo https://openqa.suse.de/tests/6410225 | env dry_run=1 host=openqa.suse.de ./openqa-label-known-issues
openqa-cli api --host https://openqa.suse.de -X POST jobs/6410225/comments text=poo#95458 [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*command.*mkfs -t ocfs2.*timed out.*match=root-console timed out":retry
openqa-cli api --host https://openqa.suse.de -X POST jobs/6410225/restart

so a comment would have been written and the test should have been restarted, assuming this works this way over the API for multi-machine clusters.

Actions

Copy link

#14

Updated by okurz almost 4 years ago

Subject changed from [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*command.*mkfs -t ocfs2.*timed out.*match=root-console timed out":retry to [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry

Actions

Copy link

#15

Updated by okurz almost 4 years ago

Related to action #95788: [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network added

Actions

Copy link

#16

Updated by okurz almost 4 years ago

Description updated (diff)

Actions

Copy link

#17

Updated by okurz almost 4 years ago

Project changed from openQA Project (public) to openQA Tests (public)
Subject changed from [HA] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry to [ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry
Category changed from Support to Bugs in existing tests
Status changed from Feedback to Workable
Assignee changed from okurz to acarvajal

the auto-review regex matching might be a bit too broad as it also catches issues like https://openqa.suse.de/tests/6581119#step/iscsi_client/18 where the test fails in iscsi and then also the post_fail_hook fails to select a free root terminal. However this is all within the scope of tests/ha so I will leave this to you again. As followup to #95458#note-2 , sorry, I don't see how this is a problem with openQA itself.

Actions

Copy link

#18

Updated by MDoucha almost 4 years ago

MDoucha wrote:

My first guess would be that the test somehow gets into a kernel panic. Add the ignore_level kernel command line parameter to grub.cfg during incident installation to see kernel backtraces in serial console logs. Here's an example how LTP adds the ignore_level kernel parameter:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/kernel/install_ltp.pm#L353

Correction of my suggestion: the kernel command line parameter is actually ignore_loglevel. Also updating the link above to permalink:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/49342da7528b6bc0a8b418090487bc40c7f8e4ce/tests/kernel/install_ltp.pm#L359

Actions

Copy link

#19

Updated by okurz almost 4 years ago

Due date changed from 2021-07-30 to 2021-08-06
Target version changed from Ready to future

@acarvajal are you ok to continue here?

Actions

Copy link

#20

Updated by okurz almost 4 years ago

Related to action #94171: [qem][sap] test fails in check_logs about 50% of times added

Actions

Copy link

#21

Updated by acarvajal almost 4 years ago

Project changed from openQA Tests (public) to openQA Project (public)
Due date changed from 2021-08-06 to 2021-09-10
Category deleted (~~Bugs in existing tests~~)
Status changed from Workable to Feedback
Assignee changed from acarvajal to okurz
Target version changed from future to Ready

okurz wrote:

@acarvajal are you ok to continue here?

Yes. I think so. I will probably sync with you before doing so though.

Actions

Copy link

#22

Updated by acarvajal almost 4 years ago

Project changed from openQA Project (public) to openQA Tests (public)
Category set to Bugs in existing tests
Status changed from Feedback to Workable
Assignee changed from okurz to acarvajal
Target version changed from Ready to future

Actions

Copy link

#23

Updated by szarate almost 4 years ago

Related to action #97013: [qe-core][qe-yast] test fails in handle_reboot, patch_and_reboot, installation added

Actions

Copy link

#24

Updated by openqa_review almost 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: migration_offline_scc_verify_sle15sp1_ltss_ha_alpha_node02
https://openqa.suse.de/tests/6958789

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The label in the openQA scenario is removed

Actions

Copy link

#25

Updated by okurz over 3 years ago

Subject changed from [ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry to [qe-sap][ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry

Using keyword "qe-sap" as verified by jmichel in weekly QE sync 2021-09-15

Actions

Copy link

#26

Updated by okurz over 3 years ago

this ticket is exceeding its due-date. It popped up during the weekly QE sync 2021-09-22. We would appreciate a reaction within the next days, at least updating the due-date according to what we can realistically expect. See https://progress.opensuse.org/projects/openqatests/wiki#SLOs-service-level-objectives for details

Actions

Copy link

#27

Updated by acarvajal over 3 years ago

Due date changed from 2021-09-10 to 2021-10-15

Actions

Copy link

#28

Updated by openqa_review over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_qdevice_node2
https://openqa.suse.de/tests/7350101

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Actions

Copy link

#29

Updated by openqa_review over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_qdevice_node2
https://openqa.suse.de/tests/7393110

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Actions

Copy link

#30

Updated by okurz over 3 years ago

so we discussed in a meeting. Thank you for inviting me :)

As this is a ticket about the tests themselves I suggest that you within qe-sap drive the work. And from the tools team we will be able to provide support and stand ready for collaboration. If you don't plan to have the issue fixed soon the according tests can also be unscheduled according to the QAM processes until (hopefully) eventually the tests can be brought back. One additional observation from my side: Some months ago we still had highly stable multi-machine tests for the areas "network" as well as "HPC" and "SAP" but it seems all three areas have not received a lot of love. There are multiple other areas where multi-machine tests are running just fine so I am not aware of any generic openQA problems. What I see are limited to aforementioned areas hence it's unlikely that the issues will go away until explicitly addressed because they are domain-specific. Based on my former experiences those issues could very well point to valid product regressions.
And https://openqa.suse.de/tests?match=wicked_ shows that openQA multi-machine tests relying on the network can work very reliably. No network related problem showing up there. A quick SQL query revealed that for the fail ratio of "wicked" tests within the last 10 days is 0.6% so very low. This shows that openQA multi-machine tests can be very stable and also shows that we don't have a generic problem in our tooling or infrastructure. Regardless if it's sporadic product issues or test design flaws at this point I recommend to focus on mitigating the negative impact on the openQA review procedures in short-term and remove the according tests from the schedule. Also see https://confluence.suse.com/display/openqa/QAM+openQA+review+guide as reference. During the time the tests are not within the validation schedule QE-SAP would need to ensure by other means that the product quality is sufficient.

@acarvajal does it make sense for you to stay assigned to this ticket?

@rbranco as direct followup to what we just discussed, have you seen #95458#note-29 ? I suggest to remove ha_qdevice_node2 and all related test suites from the validation job groups until the problem can be fixed. Can you create an according merge request for the test schedule in https://gitlab.suse.de/qa-maintenance/qam-openqa-yml ?

Actions

Copy link

#31

Updated by acarvajal over 3 years ago

Assignee changed from acarvajal to rbranco

okurz wrote:

And https://openqa.suse.de/tests?match=wicked_ shows that openQA multi-machine tests relying on the network can work very reliably. No network related problem showing up there.

This issue is definitely not network-related. 100% agreement with you there. The ticket that was tracking network issues in HA QAM tests is: https://progress.opensuse.org/issues/95788

This ticket tracks instead the sporadic failure where the SUT VM reboots when test code is not expecting it, and even if this is more frequent in MM scenarios, it can also happen on SM scenarios.

And this is not limited only to osd. In openqa.wdf.sap.corp I've noticed the same issue in:

9 jobs out of 384 jobs total for this week.
4 jobs out of 384 jobs total for the week starting on 18.10.2021.
4 jobs out of 384 jobs total for the week starting on 11.10.2021.
12 jobs out of 384 jobs total for the week starting on 4.10.2021

All these jobs are running with QEMUCPU=host since it was suggested in https://progress.opensuse.org/issues/95458#note-5, but even without that setting, failure rate was more or less the same. It is a low failure rate though, at around 1.8%.

Restarting all these tests results in them passing. Hence no other assumption to make except that a transient race condition is to blame for these failures, and not something related to the tests themselves.

A quick SQL query revealed that for the fail ratio of "wicked" tests within the last 10 days is 0.6% so very low.

What would be the fail ratio of "qdevice" and "qnetd" in the same period? Can you point me to where/how I can get that data myself?

@acarvajal does it make sense for you to stay assigned to this ticket?

I do not think so. I asked @jmichel who should I assign it to, so I am assigning this to @rbranco.

@rbranco as direct followup to what we just discussed, have you seen #95458#note-29 ? I suggest to remove ha_qdevice_node2 and all related test suites from the validation job groups until the problem can be fixed. Can you create an according merge request for the test schedule in https://gitlab.suse.de/qa-maintenance/qam-openqa-yml ?

Following up on the "wicked" exercise from above, if we go to: https://openqa.suse.de/tests?match=ha_qdevice

We'll see that out of the last 500 qdevice jobs finished in osd, there are at the time of this writing:

343 passed, 109 softfailed, 12 failed, 6 skipped, 3 incomplete and 27 parallel failed

All 3 incompletes are failures while downloading the qcow2 image. Checking the Next & Previous tab in all these 3 tests shows that these have passed earlier or later, so the only explanation that I have for the missing qcow2 image failure is that the qdevice test started after osd had already cleaned the asset from the storage.

The 12 failures amount to a 2.4% failure rate. This percentage increases to 7.8% if removing the skipped tests and adding the parallel failed ones.

However, if qdevice is a 2 node test, why are there so many more parallel failed jobs than failed jobs? The explanation lies on the failed qnetd jobs that run in parallel to the qdevice nodes. Checking https://openqa.suse.de/tests?match=ha_qnetd shows 6 failures within the last 500 jobs, so that accounts for an extra 12 parallel failed jobs.

But then when we get to why these tests failed, we see that:

Of the 6 qnetd failures, 5 were due to issues in the screen: https://openqa.suse.de/tests/7507523, https://openqa.suse.de/tests/7446239, https://openqa.suse.de/tests/7443590, https://openqa.suse.de/tests/7439788, https://openqa.suse.de/tests/7387987.
The other one, failed connecting to IBS repositories: https://openqa.suse.de/tests/7454688#step/qnetd/28
Of the 12 qdevice failures, 3 were also due to issues in the screen: https://openqa.suse.de/tests/7444348, https://openqa.suse.de/tests/7448601, https://openqa.suse.de/tests/7450384
4 seems to be due to a product bug ... a missing binary in the image: https://openqa.suse.de/tests/7519115#step/iscsi_client/13, https://openqa.suse.de/tests/7518128#step/iscsi_client/13, https://openqa.suse.de/tests/7511904#step/iscsi_client/13, https://openqa.suse.de/tests/7511001#step/iscsi_client/13
2 due to connectivity issues with the support server network: https://openqa.suse.de/tests/7478390#step/qnetd/22, https://openqa.suse.de/tests/7460365#step/ha_cluster_join/11
1 due to a slow worker: https://openqa.suse.de/tests/7456662#step/ha_cluster_join/13
The other 2 due to an HA dependency not starting: https://openqa.suse.de/tests/7509152#step/check_after_reboot/15, https://openqa.suse.de/tests/7451349#step/ha_cluster_init/17

So, out of 1000 jobs qdevice/qnetd tests:

18 failures.
8 of those due to screen rendering --> Not related to the tests themselves.
4 possibly due to a product bug
1 due to worker performance --> Not related to the test itself.
2 due to support server network connectivity issues --> Not related to the test themselves.
1 due to failed connection to IBS --> Could be due to a test setting, QAM bot test scheduling, or IBS repository not present.
2 due to HA dependencies not starting --> Could be product issue. Could be worker performance.

Not a single failure related to this ticket. And more than that, not enough grounds for removing these tests from the schedule.

No idea why the focus on the qdevice/qnetd tests when this issue was opened for several different scenarios.

I also insist that decreasing test coverage should not be the approach, but I will refrain from beating this dead horse.

Actions

Copy link

#32

Updated by okurz over 3 years ago

acarvajal wrote:

[…]

I appreciate your thorough analysis and all of it up to this point is completely correctly evaluated. But still we should react on the problem at hand: QE-SAP and/or HA tests sometimes "randomly fail", e.g. due to the reported "spontaneous reboot issues". The consequence is that qa-maintenance/bot-ng will not auto-approve according SLE maintenance updates and the group of openQA maintenance test reviewers that should merely "coordinate" are asked for help to move the according SLE maintenance updates forward. 1. These test reviewers should not even need to do that job because the team qe-sap should do it (see https://confluence.suse.com/display/qasle/openQA+QE+Maintenance+Review ) and 2. The openQA maintenance test reviewers do not have the capacity to fix all the different test instabilities themselves. So based on which individual issues you identified let me provide some specific questions or suggestions:

So, out of 1000 jobs qdevice/qnetd tests:

18 failures.

8 of those due to screen rendering --> Not related to the tests themselves.

even if that is the case. Is there a ticket to improve that situation?

4 possibly due to a product bug

so where is the follow-up for that? Just the job that was mentioned by openqa-review https://openqa.suse.de/tests/7393110#step/check_after_reboot/15 shows that a service fails to start up. If all cases would be like these then we should be be good. But this only works if everybody sees the test failures the same way.

1 due to worker performance --> Not related to the test itself.

ok, but which ticket covers this then? Commonly people say "worker performance issue" when tests apply too strict timeouts which need to be changed on the test level. If you find issues where you really think that it's such a bad performance that test changes won't help then please let us know about the specific cases so that we can look into them.

2 due to support server network connectivity issues --> Not related to the test themselves.

but which ticket then?

1 due to failed connection to IBS --> Could be due to a test setting, QAM bot test scheduling, or IBS repository not present.

in that case tests can still be improved with better retrying

[…]

Not a single failure related to this ticket. And more than that, not enough grounds for removing these tests from the schedule.

No idea why the focus on the qdevice/qnetd tests when this issue was opened for several different scenarios.

I came back to this ticket for two reasons: Because there is https://progress.opensuse.org/issues/95458#note-29 pointing to a test failing due to the "test issue" described in this ticket (even though this might not be true but the test is labeled like that). And second because the due-date was exceeded by more than ten days so violating https://progress.opensuse.org/projects/openqatests/wiki#SLOs-service-level-objectives hence I noticed it.

I also insist that decreasing test coverage should not be the approach, but I will refrain from beating this dead horse.

Well, removing the test (temporarily) until fixed should only be seen as last resort. And to be honest: I repeatedly mention that option as the way I would go to spawn some motivation to fix it ;)

Actions

Copy link

#33

Updated by acarvajal over 3 years ago

okurz wrote:

These test reviewers should not even need to do that job because the team qe-sap should do it (see https://confluence.suse.com/display/qasle/openQA+QE+Maintenance+Review ) and 2. The openQA maintenance test reviewers do not have the capacity to fix all the different test instabilities themselves.

No disagreements there. What I disagree on - and what I have repeated more times than I care to count - is that the solution should never be dropping coverage. Even if as you say "randomly failing" tests impact automatic reviews so much, it's a practice that leads to a false sense of security. Even when the practice is followed on, it can lead to unacceptable drops in coverage (for example, see https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/merge_requests/150 & https://progress.opensuse.org/issues/68932).

So based on which individual issues you identified let me provide some specific questions or suggestions:

OK. Feels like you're dodging my argument which was "why remove from the schedule a test that has a success rate of over 90%?", but let's go ahead.

So, out of 1000 jobs qdevice/qnetd tests:

18 failures.

8 of those due to screen rendering --> Not related to the tests themselves.

even if that is the case. Is there a ticket to improve that situation?

No idea. Is it?

4 possibly due to a product bug

so where is the follow-up for that? Just the job that was mentioned by openqa-review https://openqa.suse.de/tests/7393110#step/check_after_reboot/15 shows that a service fails to start up. If all cases would be like these then we should be be good. But this only works if everybody sees the test failures the same way.

Huh? I did mention https://openqa.suse.de/tests/7393110#step/check_after_reboot/15 below.

Is there a follow up to the 4 tests failing due to a product bug? Cannot say. First time I saw these failures was yesterday, and from what I could see, later tests on those scenarios passed.

Should I open a bug for an issue that's already fixed?

1 due to worker performance --> Not related to the test itself.

ok, but which ticket covers this then? Commonly people say "worker performance issue" when tests apply too strict timeouts which need to be changed on the test level. If you find issues where you really think that it's such a bad performance that test changes won't help then please let us know about the specific cases so that we can look into them.

"Test apply too strict timeouts"???? Give me a break!

If the response to any issue presented is always going to be "perhaps you have a wrong setting", "perhaps there is a problem in the test code" or "perhaps you were too aggressive with a timeout", then we will continue to ignore potential issues.

In this particular case:

https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/lib/hacluster.pm#L91
"TIMEOUT_SCALE" : 2, from https://openqa.suse.de/tests/7456662/file/vars.json

Command waited for 2 whole minutes before failing. Is 2 minutes too aggressive? I do not think so.

I grant that I may have misspoken when claiming "slow worker" though. On that command, issue could also be network related. Sadly no way to tell from the failure, and of course same test, same scenario, but a later run (several in fact) show passing results: https://openqa.suse.de/tests/7549015

2 due to support server network connectivity issues --> Not related to the test themselves.

but which ticket then?

https://progress.opensuse.org/issues/95788

1 due to failed connection to IBS --> Could be due to a test setting, QAM bot test scheduling, or IBS repository not present.

in that case tests can still be improved with better retrying

Agree.

Well, removing the test (temporarily) until fixed should only be seen as last resort. And to be honest: I repeatedly mention that option as the way I would go to spawn some motivation to fix it ;)

Understood. Hope you succeed and manage to mobilize all the teams required to get to a solution. From my experience, even if QE-SAP are the experts on these scenarios, these random failures usually fall outside of QE-SAP field of expertise, so I do agree and believe that a coordinated effort is required.

Actions

Copy link

#34

Updated by rbranco over 3 years ago

Due date changed from 2021-10-15 to 2022-01-31

Actions

Copy link

#35

Updated by openqa_review over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_beta_node02
https://openqa.suse.de/tests/7671905

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Actions

Copy link

#36

Updated by openqa_review over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_hawk_haproxy_node01
https://openqa.suse.de/tests/7749089

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Actions

Copy link

#37

Updated by openqa_review over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_ctdb_node01
https://openqa.suse.de/tests/7829513

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Actions

Copy link

#38

Updated by openqa_review over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_ctdb_node02
https://openqa.suse.de/tests/7871807

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Actions

Copy link

#39

Updated by openqa_review over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_diskless_sbd_qdevice_node1
https://openqa.suse.de/tests/7976342

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Actions

Copy link

#40

Updated by jkohoutek over 3 years ago

Today I was looking it this again when solving issues with this: https://openqa.suse.de/tests/8012986#step/check_logs/2

From my observation it look like, that ALL jobs which running time reaching 2h fails, but the faster ones around 1h success. Between those it's random, but also usually success: https://openqa.suse.de/tests/8005979#next_previous

Question is, why the same update once took almost 2 hours and fails:

 check_logs 	:22413:saptune	3 days ago ( 01:55 hours )

https://openqa.suse.de/tests/7993472

but day later it took just a 1 hour and success:

  :22413:saptune	2 days ago ( 01:10 hours )

https://openqa.suse.de/tests/8005979#:~:text=%3A22413%3Asaptune,days%20ago%20(%2001%3A10%20hours%20)

Actions

Copy link

#41

Updated by rbranco over 3 years ago

Due date changed from 2022-01-31 to 2022-04-30

Actions

Copy link

#42

Updated by openqa_review over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_ha_rolling_upgrade_migration_node01
https://openqa.suse.de/tests/8192134

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Actions

Copy link

#43

Updated by rbranco about 3 years ago

Status changed from Workable to Resolved

No longer seeing this issue with:

https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/merge_requests/231
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14457

Actions

Copy link

#44

Updated by okurz about 3 years ago

I can confirm. So far this looks good. openqa-query-for-job-label poo#95458 shows

2201470|2022-02-22 04:16:29|done|failed|container-host-microos||openqaworker7
8318522|2022-03-13 03:51:09|done|failed|qam_ha_rolling_update_node01||openqaworker9
8306778|2022-03-10 14:14:56|done|failed|ha_beta_node02||malbec
8296656|2022-03-09 17:30:46|done|failed|ha_beta_node02||QA-Power8-5-kvm
8297435|2022-03-09 12:45:09|done|failed|migration_offline_dvd_verify_sle15sp1_ltss_ha_alpha_node01||openqaworker3
8297914|2022-03-09 10:33:55|done|failed|ha_hawk_haproxy_node02||openqaworker6
8297846|2022-03-09 10:07:52|done|failed|ha_alpha_node01||QA-Power8-5-kvm
8292351|2022-03-09 05:02:30|done|failed|ha_qdevice_node2||QA-Power8-5-kvm
8293695|2022-03-09 04:59:57|done|failed|ha_qdevice_node2||openqaworker-arm-1
8293662|2022-03-09 04:38:34|done|failed|ha_ctdb_node02||openqaworker-arm-2
8293690|2022-03-09 03:52:44|done|failed|ha_priority_fencing_node01||openqaworker-arm-3

so the latest failures were on 2022-03-09

Actions

Copy link

#45

Updated by openqa_review about 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_priority_fencing_node01
https://openqa.suse.de/tests/8439544#step/iscsi_client/1

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions

Copy link

#46

Updated by openqa_review about 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_delta_node02
https://openqa.suse.de/tests/8741156#step/iscsi_client/1

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 60 days if nothing changes in this ticket.

Actions

Copy link

#47

Updated by bschmidt over 2 years ago

Status changed from Resolved to In Progress

unfortunately this happens again :-(
see https://openqa.suse.de/tests/9679509#step/check_after_reboot/30

Actions

Copy link

#48

Updated by slo-gin over 2 years ago

This ticket is 10 days after the due-date. Please consider closing this ticket or move the due-date accordingly.

Actions

Copy link

#49

Updated by rbranco over 2 years ago

I vote for closing this ticket as the issue has nothing to do with SAP/HA.

Actions

Copy link

#50

Updated by slo-gin over 2 years ago

This ticket is 10 days after the due-date. Please consider closing this ticket or move the due-date accordingly.

Actions

Copy link

#51

Updated by acarvajal over 2 years ago

rbranco wrote:

I vote for closing this ticket as the issue has nothing to do with SAP/HA.

Is the issue gone? Judging by https://progress.opensuse.org/issues/95458#note-47 it isn't.

Before closing I would vote for re-assignment.

Actions

Copy link

#52

Updated by slo-gin over 2 years ago

This ticket is 10 days after the due-date. Please consider closing this ticket or move the due-date accordingly.

Actions

Copy link

#53

Updated by rbranco over 2 years ago

acarvajal wrote:

rbranco wrote:

I vote for closing this ticket as the issue has nothing to do with SAP/HA.

Is the issue gone? Judging by https://progress.opensuse.org/issues/95458#note-47 it isn't.

Before closing I would vote for re-assignment.

This poo is too generic IMHO. Can you please reassign to someone? I will be in squad rotation in November until February.

Actions

Copy link

#54

Updated by slo-gin over 2 years ago

This ticket is 10 days after the due-date. Please consider closing this ticket or move the due-date accordingly.

Actions

Copy link

#55

Updated by okurz over 2 years ago

Due date deleted (~~2022-04-30~~)

This ticket had a due set but exceeded it already by more than 14 days. We would like to take the due date seriously so please update the ticket accordingly (resolve the ticket or update the due-date or remove the due-date). See https://progress.opensuse.org/projects/openqatests/wiki/Wiki#SLOs-service-level-objectives for details.

Actions

Copy link

#56

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: migration_offline_dvd_verify_sle12sp5_ha_alpha_node01_atmg
https://openqa.suse.de/tests/9918083#step/migrate_clvmd_to_lvmlockd/1

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 44 days if nothing changes in this ticket.

Actions

Copy link

#57

Updated by rbranco over 2 years ago

Assignee changed from rbranco to fgerling

Actions

Copy link

#58

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: migration_offline_dvd_verify_sle15sp2_ltss_ha_alpha_node02
https://openqa.suse.de/tests/10027954#step/check_after_reboot/1

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions

Copy link

#59

Updated by vsvecova over 2 years ago

Hello @fgerling, I'm wondering whether there is any update about this issue? It has been blocking the autoapproval for quite a few maintenance updates lately.

Actions

Copy link

#60

Updated by fgerling over 2 years ago

Status changed from In Progress to Feedback

The last comments from Alvaro and Ricardo are mentioning that it is a non SAP specific issue. Is there a process to hand it over?
I requested feedback from PO, in regards to priority, and will update here when I get an answer.

Actions

Copy link

#61

Updated by vsvecova over 2 years ago

I'm not sure how you define specific, but I don't recall seeing this issue anywhere else than HA-related jobs. I'm not aware of any ticket hand-over process; I guess the rule of thumb has always been that the squad whose tests are failing is also responsible for the fix. In any case, I'm pondering about the usefulness of a test that fails so often. Wouldn't it make more sense to just unschedule them?

Actions

Copy link

#62

Updated by LMartin over 2 years ago

Assignee changed from fgerling to bschmidt

This ticket describes SUT reboots which are unexpected and sporadic, e.g. #note-31 and #note-32 from a year ago. Indeed, it was so sporadic back then that it was hard to find a reproducer.
However looking at the recent fails from 15 SP5 in this ticket ( https://openqa.suse.de/tests/10027954#next_previous and https://openqa.suse.de/tests/10027954#next_previous ) those new failures are very frequent. So those two cases are either the reproducer which has been asked for in this ticket, or an actual product bug which needs attention.

Miura and Birger: can you please check and give feedback here if https://openqa.suse.de/tests/10027954#next_previous and https://openqa.suse.de/tests/10027954#next_previous are real issues or the sporadic unexpected reboots described here.

Regarding autoapprovals of maintenance updates I have asked Ednilson Miura and Birger Schmidt from QE-SAP to keep an extra eye http://dashboard.qam.suse.de/blocked to make sure SAP/HA tests are not unnecessarily blocking updates due to these sporadic SUT reboots, e.g. those tests should (as a workaround) be retriggered & assessed to see if they are real issues or these sporadic reboots. If you see QE-SAP blocking maintenance updates, please feel free to reach out and ask for the time being.

For a longer term solution I need to verify with Alvaro when he returns from vacation. I need to understand if there is a commonality with these failures, e.g. always migration or some specific HA test(s)? or only migration etc.
In 2023 we can for sure migrate the tests to new workers, but based on what I read above, the resolution is probably not that simple.

And no, unscheduling tests does not make sense in my view. Fixing broken tests makes absolute sense though.

Actions

Copy link

#63

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: migration_offline_dvd_verify_sle15sp2_ltss_ha_alpha_node02
https://openqa.suse.de/tests/10218340#step/check_after_reboot/1

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions

Copy link

#64

Updated by tinita over 2 years ago

I can see that grep sometimes times out with the regex in the title.
Since (?s) is used, every .* can span over multiple lines, and that involves a lot of backtracking and might not be needed. Please consider to change some of the .* to [^\n]*, or drop the (?s) and change the .* that has to be line spanning to [\S\s]*.

Actions

Copy link

#65

Updated by openqa_review about 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: migration_offline_scc_verify_sle12sp4_ltss_ha_alpha_node02
https://openqa.suse.de/tests/10562975#step/migrate_clvmd_to_lvmlockd/1

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 80 days if nothing changes in this ticket.

Actions

Copy link

#66

Updated by openqa_review about 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: xfstests_xfs-xfs-reflink
https://openqa.suse.de/tests/11162974#step/run/1

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 160 days if nothing changes in this ticket.

Actions

Copy link

#67

Updated by acarvajal over 1 year ago

Assignee changed from bschmidt to acarvajal

Actions

Copy link

#68

Updated by acarvajal over 1 year ago

Category changed from Bugs in existing tests to Infrastructure
Status changed from Feedback to Closed

Closing this until issue is seen again to reference more recent jobs.

Actions

Copy link

#69

Updated by openqa_review over 1 year ago

Status changed from Closed to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_ctdb_node01
https://openqa.suse.de/tests/12857331#step/iscsi_client/1

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 40 days if nothing changes in this ticket.

Actions

Copy link

#70

Updated by openqa_review about 1 year ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: migration_offline_scc_verify_sle15sp3_ha_alpha_node02
https://openqa.suse.de/tests/14373096#step/check_after_reboot/1

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 316 days if nothing changes in this ticket.

Actions

Copy link

#71

Updated by openqa_review about 1 month ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: sles4sap_hana_node02@ppc64le-sap-qam
https://openqa.suse.de/tests/15040464#step/fencing/1

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 688 days if nothing changes in this ticket.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Tests (public)

Tags

Custom queries

action #95458

[qe-sap][ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry

Observation¶

Test suite description¶

Reproducible¶

Expected result¶

Updated by maritawerner almost 4 years ago

Updated by szarate almost 4 years ago

Updated by MDoucha almost 4 years ago

Updated by MDoucha almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by acarvajal almost 4 years ago

Updated by okurz almost 4 years ago

Updated by acarvajal almost 4 years ago

Updated by okurz almost 4 years ago

Updated by acarvajal almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by MDoucha almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by acarvajal almost 4 years ago

Updated by acarvajal almost 4 years ago

Updated by szarate almost 4 years ago

Updated by openqa_review almost 4 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by acarvajal over 3 years ago

Updated by openqa_review over 3 years ago

Updated by openqa_review over 3 years ago

Updated by okurz over 3 years ago

Updated by acarvajal over 3 years ago

Updated by okurz over 3 years ago

Updated by acarvajal over 3 years ago

Updated by rbranco over 3 years ago

Updated by openqa_review over 3 years ago

Updated by openqa_review over 3 years ago

Updated by openqa_review over 3 years ago

Updated by openqa_review over 3 years ago

Updated by openqa_review over 3 years ago

Updated by jkohoutek over 3 years ago

Updated by rbranco over 3 years ago

Updated by openqa_review over 3 years ago

Updated by rbranco about 3 years ago

Updated by okurz about 3 years ago

Updated by openqa_review about 3 years ago

Updated by openqa_review about 3 years ago

Updated by bschmidt over 2 years ago

Updated by slo-gin over 2 years ago

Updated by rbranco over 2 years ago

Updated by slo-gin over 2 years ago

Updated by acarvajal over 2 years ago

Updated by slo-gin over 2 years ago

Updated by rbranco over 2 years ago

Updated by slo-gin over 2 years ago

Updated by okurz over 2 years ago

Updated by openqa_review over 2 years ago

Updated by rbranco over 2 years ago

Updated by openqa_review over 2 years ago

Updated by vsvecova over 2 years ago

Updated by fgerling over 2 years ago

Updated by vsvecova over 2 years ago

Updated by LMartin over 2 years ago

Updated by openqa_review over 2 years ago

Updated by tinita over 2 years ago

Updated by openqa_review about 2 years ago

Updated by openqa_review about 2 years ago

Updated by acarvajal over 1 year ago

Updated by acarvajal over 1 year ago

[qe-sap][ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.(command.timed out|Test died).*match=root-console timed out":retry