Project

General

Profile

Actions

action #95788

open

[qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network

Added by okurz about 3 years ago. Updated 7 months ago.

Status:
Feedback
Priority:
Normal
Assignee:
-
Category:
Bugs in existing tests
Target version:
-
Start date:
2021-07-21
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-15-SP2-Server-DVD-HA-Incidents-x86_64-qam_ha_priority_fencing_node01@64bit fails in
iscsi_client
or other modules being unable to resolve DNS.

Reproducible

Fails sporadically.

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#95788

Expected result

Ability to resolve download.suse.com and other hosts within curl or zypper calls

Further details

Always latest result in this scenario: latest

Suggestions

  • Someone from Tools and from SAP pair up to debug this
  • Ask maintainer (Loic)
  • get familiar with iscsi (storage over TCP)

Related issues 5 (1 open4 closed)

Related to openQA Tests - action #95458: [qe-sap][ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retryFeedbackacarvajal2021-07-13

Actions
Related to openQA Tests - action #95801: [qe-sap][ha][css][shap] test fails in register_system of multi-machine HA tests, failing to access networkRejected2021-07-21

Actions
Related to openQA Project - coordination #96185: [epic] Multimachine failure rate increasedResolvedokurz2021-07-29

Actions
Related to openQA Project - action #69976: Show dependency graph for cloned jobsResolvedmkittler2020-08-13

Actions
Related to openQA Project - action #154552: [ppc64le] test fails in iscsi_client - zypper reports Error Message: Could not resolve host: openqa.suse.deResolvedmkittler2024-01-30

Actions
Actions #1

Updated by okurz about 3 years ago

  • Related to action #95458: [qe-sap][ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry added
Actions #2

Updated by okurz about 3 years ago

  • Description updated (diff)
Actions #3

Updated by okurz about 3 years ago

  • Subject changed from [ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*command.*curl":retry to [ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*command.*curl.*failed":retry
Actions #4

Updated by okurz about 3 years ago

  • Subject changed from [ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*command.*curl.*failed":retry to [ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*post_fail_hook failed: command[^$]*curlcommand.*curl":retry
Actions #5

Updated by okurz about 3 years ago

  • Subject changed from [ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*post_fail_hook failed: command[^$]*curlcommand.*curl":retry to [ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*post_fail_hook failed: command[^$].*curl":retry
Actions #6

Updated by okurz about 3 years ago

  • Subject changed from [ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*post_fail_hook failed: command[^$].*curl":retry to [ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*post_fail_hook failed: command.*curl":retry
Actions #7

Updated by okurz about 3 years ago

Updated the auto-review regex to still match on https://openqa.suse.de/tests/6491214/ but not on https://openqa.suse.de/tests/6426081 which is about #95458 instead

Actions #8

Updated by okurz about 3 years ago

  • Related to action #95801: [qe-sap][ha][css][shap] test fails in register_system of multi-machine HA tests, failing to access network added
Actions #9

Updated by acarvajal about 3 years ago

While reviewing SLES+HA and SLES for SAP Applications QR build 188.13 results, I ran into several jobs that failed with this issue.

All jobs have been tagged with this poo#. These are:

Between round brackets: architecture - module that failed - reason.

Actions #10

Updated by okurz about 3 years ago

Actions #11

Updated by openqa_review about 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_hawk_haproxy_node02
https://openqa.suse.de/tests/6642090

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The label in the openQA scenario is removed
Actions #12

Updated by acarvajal about 3 years ago

  • Subject changed from [ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*post_fail_hook failed: command.*curl":retry to [ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry
Actions #13

Updated by acarvajal about 3 years ago

Actions #14

Updated by openqa_review about 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: xfstests_btrfs-generic-401-999
https://openqa.suse.de/tests/6982589

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The label in the openQA scenario is removed
Actions #15

Updated by okurz almost 3 years ago

  • Subject changed from [ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry to [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry

Using keyword "qe-sap" as verified by jmichel in weekly QE sync 2021-09-15

Actions #16

Updated by openqa_review almost 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: select_modules_and_patterns
https://openqa.suse.de/tests/7261466

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234
Actions #17

Updated by openqa_review almost 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: xfstests_btrfs-btrfs-151-999
https://openqa.suse.de/tests/7362281

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234
Actions #18

Updated by okurz almost 3 years ago

Please see my proposal to remove the failing test modules from the schedule until the issue could be resolved:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/13475

Actions #19

Updated by okurz almost 3 years ago

  • Subject changed from [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry to [tools][qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry
  • Due date set to 2021-10-20
  • Priority changed from Normal to Urgent
  • Target version set to Ready

As proposed by vpelcak we are looking for a volunteer from both SUSE QE Tools as well as QE SAP to collaborate and fix until next Tue EOB, otherwise the according tests should be disabled, e.g. as proposed in https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/13475

Actions #20

Updated by livdywan almost 3 years ago

  • Description updated (diff)
  • Status changed from New to Workable
  • Assignee set to livdywan

I'm taking a look, and will query who might help as a domain expert

Actions #21

Updated by livdywan almost 3 years ago

okurz wrote:

Observation

openQA test in scenario sle-15-SP2-Server-DVD-HA-Incidents-x86_64-qam_ha_priority_fencing_node01@64bit fails in
iscsi_client

testapi::assert_script_run("curl --form upload=\@/var/log/zypper.log --form upname=iscsi_c"..., 90)

Reproducible

Fails sporadically.

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#95788

This unfortunately gives me some invalid job ID's. A couple that worked with the according errors:

https://openqa.suse.de/tests/7451408

sulogin: tcgetattr failed: Input/output error

https://openqa.suse.de/tests/7443999

Bad Request (400)
googleapi: Error 400: Precondition check failed., failedPrecondition

Further details

Always latest result in this scenario: latest

Looking at previous results, I just see softfails there.

Actions #22

Updated by acarvajal almost 3 years ago

okurz wrote:

Please see my proposal to remove the failing test modules from the schedule until the issue could be resolved:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/13475

This is a bad idea.

For starters, the PR is changing the contents of schedule/ha/qam/common/qam_ha_rolling_upgrade_migration_node01.yaml but the referenced test uses schedule/ha/qam/common/qam_ha_rolling_upgrade_migration_node01_sle12.yaml instead.

But even if the wrong schedule issue was addressed in the PR, and while it could have the intended outcome for the job linked in the PR, that very same job is not a standalone job, so commenting those modules from the appropriate schedule will just move the failure to any of the other jobs in the MM cluster. In short, it would trade an sporadic failure[1], for a permanent one.

If the PR is merged as is (with the wrong schedule), not only will it not impact the linked job, but it would introduce issues where the schedule is currently in use:

  • The schedule is present in https://gitlab.suse.de/acarvajal/qam-openqa-yml/-/blob/master/JobGroups/qam_qu/sle15sp2.yml#L1521 (15-SP2 QU).[2]
  • It's also configured as a setting in the qam_ha_rolling_upgrade_migration_node01 test suite in osd.
  • The qam_ha_rolling_upgrade_migration_node01 is used in the QAM TestRepo job groups for 12-SP3, 12-SP4, 12-SP5, 15-SP1, 15-SP2 and 15-SP3, but in most of those the YAML_SCHEDULE setting is being overwritten via the Job Group configuration. AFAICS, the schedule is however used in 15-SP2 TestRepo job group[3], quite successfully I might add.

On the other hand, if the PR is merged with the right schedule, the linked node 1 job may pass, but I see 2 possible outcomes for the related jobs:

Another thing to consider is that this scenario is testing rolling upgrade, i.e.:

  1. Configure 2 nodes with the HA stack.
  2. Add some resources to the HA cluster.
  3. Stop cluster on node 1
  4. Upgrade node 1 to the next SP version of SLES+HA
  5. Start cluster on node 1
  6. Check HA cluster health
  7. Stop cluster on node 2
  8. Upgrade node 2 to the next SP version of SLES+HA
  9. Start cluster on node 2
  10. Check HA cluster health

By commenting the test modules in the PR, no cluster health is being checked (step 6 in the list above) after node 1 migration. In essence, the job will remain, it may pass (but not the other jobs in the MM setup) but the job itself will not be testing anything relevant. It will of course test cluster setup in a previous version of SLES+HA, which is already covered elsewhere.

Finally, the correct YAML schedule at schedule/ha/qam/common/qam_ha_rolling_upgrade_migration_node01_sle12.yaml is not only used in the 12-SP4 HA Single Incidents job group where the linked test is located. It's also in use in 12-SP3 and 12-SP5:

acarvajal:~/git/qam-openqa-yml [master|✔] > grep -r schedule/ha/qam/common/qam_ha_rolling_upgrade_migration_node01_sle12.yaml
JobGroups/test_repo/sle12sp3.yml:          YAML_SCHEDULE: 'schedule/ha/qam/common/qam_ha_rolling_upgrade_migration_node01_sle12.yaml'
JobGroups/test_repo/sle12sp4.yml:          YAML_SCHEDULE: 'schedule/ha/qam/common/qam_ha_rolling_upgrade_migration_node01_sle12.yaml'
JobGroups/test_repo/sle12sp5.yml:          YAML_SCHEDULE: 'schedule/ha/qam/common/qam_ha_rolling_upgrade_migration_node01_sle12.yaml'

In 12-SP5 test has ca. 89% success rate[4] at the time of this writing, while in 12-SP3 it has a 82% success rate[5].

In conclusion, removing/commenting test modules from the schedule, will break working tests in 12-SP3 and 12-SP5 QAM TestRepo job groups, while in 12-SP4 it will simply move an sporadic failure in node 1, to a permanent failure in node 2 or the support server.

If decision is to remove the test, better to do it by commenting related node 1, node 2 and supportserver jobs in https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/blob/master/JobGroups/test_repo/sle12sp4.yml

This is what was done in 15-SP1 due to bsc#1183744: https://gitlab.suse.de/acarvajal/qam-openqa-yml/-/blob/master/JobGroups/test_repo/sle15sp1.yml#L316-337

[1] Yes! 70% is still sporadic. It means the test is passing 30% of the time.
[2] https://openqa.suse.de/tests/7288981#next_previous ... 75% success rate, but only 4 jobs.
[3] https://openqa.suse.de/tests/7451366#next_previous ... over 90% success rate.
[4] https://openqa.suse.de/tests/7451334#next_previous
[5] https://openqa.suse.de/tests/7451314#next_previous

Actions #23

Updated by okurz almost 3 years ago

If decision is to remove the test, better to do it by commenting related node 1, node 2 and supportserver jobs in https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/blob/master/JobGroups/test_repo/sle12sp4.yml

Sounds good. So if no fix for the underlying issues could be found until EOB tomorrow then this schedule change should be ready to be merged at that time.

Actions #24

Updated by livdywan almost 3 years ago

Ricardo kindly helped me understand a little better what's happening here, and I took some notes on what questions came up:

  • https://openqa.suse.de/tests/7221118#settings
    • There's no Dependencies tab here. I would expect to see node02 and support-server which can be seen e.g. on https://openqa.suse.de/tests/7224107#dependencies (which passed).
    • Node qam-node02 (...) UNCLEAN (offline) stands out as the most relevant error output
    • I don't know what unclean means or how the test tries to access qam-node02 and how it fails
    • This seems to originate in crm_mon -R -r -n -N -1 | grep -i 'no inactive resources'
    • A successful run seems to include an Inactive Resources: section
    • Trying crm_mon -R -r -n -N -1 on a cluster provided by Ricardo seems to have things like * rsc_ip_PRD_HDB00_start_0 on hana02 'error' (1): call=35, status='Timed Out', exitreason='', last-rc-change='2021-09-19 17:51:05 +02:00', queued=0ms, exec=20001ms, where 'error' (1): call=40, status='Timed Out' stands out to me as an error
    • TIMEOUT_SCALE 3 in job settings should mean 50 seconds times 3 meaning 150s for this job. Might make sense to increase the factor?

@acarvajal maybe you or somebody else can comment on the points above? In particular what UNCLEAN means and how the test checks it and why there's no error output there e.g. timed out or unreachable

Actions #25

Updated by okurz almost 3 years ago

cdywan wrote:

Ricardo kindly helped me understand a little better what's happening here, and I took some notes on what questions came up:

jobs which are cloned cannot consistently resolve dependencies hence this won't show there. There is a feature request about this, couldn't find the ticket right now

Actions #26

Updated by livdywan almost 3 years ago

acarvajal wrote:

If decision is to remove the test, better to do it by commenting related node 1, node 2 and supportserver jobs in https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/blob/master/JobGroups/test_repo/sle12sp4.yml

https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/merge_requests/193

Here's my attempt, following your suggestion, in case we won't have a fix by EOD.

Actions #27

Updated by livdywan almost 3 years ago

cdywan wrote:

  • Node qam-node02 (...) UNCLEAN (offline) stands out as the most relevant error output
    • I don't know what unclean means or how the test tries to access qam-node02 and how it fails
    • This seems to originate in crm_mon -R -r -n -N -1 | grep -i 'no inactive resources'
    • A successful run seems to include an Inactive Resources: section
    • Trying crm_mon -R -r -n -N -1 on a cluster provided by Ricardo seems to have things like * rsc_ip_PRD_HDB00_start_0 on hana02 'error' (1): call=35, status='Timed Out', exitreason='', last-rc-change='2021-09-19 17:51:05 +02:00', queued=0ms, exec=20001ms, where 'error' (1): call=40, status='Timed Out' stands out to me as an error

Btw this is in lib/hacluster.pm in check_cluster_state which conditionally greps for 'no inactive resources'. And I notice the crm_verify -LV is also conditionally fatal. Maybe this should not fail the test? I don't understand why it's fatal only in some cases, though, so this may be totally wrong.

  • TIMEOUT_SCALE 3 in job settings should mean 50 seconds times 3 meaning 150s for this job. Might make sense to increase the factor?

I couldn't actually find where this is set. I can only see it in the yaml for other tests.

Actions #28

Updated by acarvajal almost 3 years ago

cdywan wrote:

Ricardo kindly helped me understand a little better what's happening here, and I took some notes on what questions came up:

Agree. Taken from the short description above for the test:

  1. Configure 2 nodes with the HA stack.
  2. Add some resources to the HA cluster.
  3. Stop cluster on node 1
  4. Upgrade node 1 to the next SP version of SLES+HA
  5. Start cluster on node 1
  6. Check HA cluster health
  7. Stop cluster on node 2
  8. Upgrade node 2 to the next SP version of SLES+HA
  9. Start cluster on node 2
  10. Check HA cluster health

It seems this is happening during step 6, i.e., node 1 has just been migrated to the next SP, cluster has been restarted on that node (it would start automatically after the reboot), but then it's finding the other node unhealthy/unclean.

As to the root cause, I would think either a product bug, a communication issue between both nodes or some race condition.

Not sure increasing the timeout would help as node 2 should always be available during node 1 migration.

- I don't know what unclean means or how the test tries to access qam-node02 and how it fails
- This seems to originate in `crm_mon -R -r -n -N -1 | grep -i 'no inactive resources'`

Node 2 is unclean, there are inactive resources, so the test fails.

- A successful run seems to include an `Inactive Resources:` section

Failing test also includes the section. If you see it lists a lot of inactive resources there: https://openqa.suse.de/tests/7221118#step/check_cluster_integrity/6

What successful test should include is an empty Inactive Resources: section.

- Trying `crm_mon -R -r -n -N -1` on a cluster provided by Ricardo seems to have things like `* rsc_ip_PRD_HDB00_start_0 on hana02 'error' (1): call=35, status='Timed Out', exitreason='', last-rc-change='2021-09-19 17:51:05 +02:00', queued=0ms, exec=20001ms`, where `'error' (1): call=40, status='Timed Out'` stands out to me as an error

Different type of cluster/scenario. That error is seen on HANA clusters after a site takeover/takeback. You can see it in successful test for example at: https://openqa.suse.de/tests/7430859#step/check_after_reboot#1/15

Test modules handle that error (registers fenced HANA node for system replication again in the cluster) and test continues.

This scenario (rolling upgrade) is not using HANA.

  • TIMEOUT_SCALE 3 in job settings should mean 50 seconds times 3 meaning 150s for this job. Might make sense to increase the factor?

I think it can be tested with an increased timeout just to confirm whether it helps or not, but my hunch is that it will not help.

Actions #29

Updated by livdywan almost 3 years ago

  • Status changed from Workable to Feedback

cdywan wrote:

https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/merge_requests/193

Here's my attempt, following your suggestion, in case we won't have a fix by EOD.

The above MR was reviewed and merged.

There was a suggestion in chat to have the test in a development group. Due the concerns over breaking other tests I've not tried that.

I'm thinking if we want this or a new ticket to continue the investigation of the failures.

Actions #30

Updated by okurz almost 3 years ago

  • Due date deleted (2021-10-20)
  • Status changed from Feedback to Workable
  • Assignee deleted (livdywan)
  • Priority changed from Urgent to High
  • Target version deleted (Ready)

better continue here. But I think at this point it's better for QE SAP to decide how to go on, what to cover manually, what to fix in tests, where to test it, etc. @cdywan thanks for your help. Removing you from assignee and reducing prio after the urgent issue was addressed.

Actions #31

Updated by livdywan almost 3 years ago

  • Related to action #69976: Show dependency graph for cloned jobs added
Actions #32

Updated by okurz almost 3 years ago

  • Subject changed from [tools][qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry to [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry
Actions #33

Updated by okurz almost 3 years ago

  • Description updated (diff)
Actions #34

Updated by openqa_review almost 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos-extratest
https://openqa.suse.de/tests/7350856

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234
Actions #35

Updated by openqa_review almost 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: toolchain_zypper
https://openqa.suse.de/tests/7728612

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234
Actions #36

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos-base+phub
https://openqa.suse.de/tests/7802298

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234
Actions #37

Updated by asmorodskyi over 2 years ago

  • Subject changed from [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry to [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)(tests/ha_cluster_join|tests/iscsi_client).*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry
Actions #38

Updated by okurz over 2 years ago

  • Subject changed from [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)(tests/ha_cluster_join|tests/iscsi_client).*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry to [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)(tests/ha/ha_cluster_join|tests/iscsi/iscsi_client).*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry

paths like "tests/ha_cluster_join|tests/iscsi_client" don't exist. In os-autoinst-distri-opensuse there are paths like "tests/ha/ha_cluster_join.pm" and "tests/iscsi/iscsi_client.pm"

Actions #39

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: toolchain_zypper
https://openqa.suse.de/tests/7925816

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234
Actions #40

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: toolchain_zypper
https://openqa.suse.de/tests/7976119

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234
Actions #41

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_gamma_node03
https://openqa.suse.de/tests/8044170

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234
Actions #42

Updated by asmorodskyi over 2 years ago

  • Subject changed from [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)(tests/ha/ha_cluster_join|tests/iscsi/iscsi_client).*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry to [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network

removing autoreview due to false labeling https://openqa.suse.de/tests/8109535#comments

Actions #43

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: rsync-client
https://openqa.suse.de/tests/8197348

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234
Actions #44

Updated by rbranco over 2 years ago

  • Status changed from Workable to Resolved
Actions #45

Updated by okurz over 2 years ago

I am not sure if this will stay true. For example the latest job in https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Online&machine=64bit&test=rsync-client&version=15-SP4#next_previous , related to the last linked job in comments, was on 2022-03-08, it passed, but this was a sporadic issue. And no test was conducted since then. I wouldn't be so sure this can't happen again but I am crossing fingers as well :)

Actions #46

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_delta_node02
https://openqa.suse.de/tests/8445292#step/ha_cluster_join/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #47

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_2nodes_02
https://openqa.suse.de/tests/8630040#step/ha_cluster_join/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 60 days if nothing changes in this ticket.

Actions #49

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_2nodes_02
https://openqa.suse.de/tests/8706249#step/ha_cluster_join/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #50

Updated by openqa_review over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_2nodes_02
https://openqa.suse.de/tests/8915222#step/ha_cluster_join/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #51

Updated by openqa_review about 2 years ago

  • Status changed from Resolved to Feedback

Re-opening tickets with unhandled openqa-review reminder comment, see https://progress.opensuse.org/projects/openqatests/wiki/Wiki#openqa-review-reminder-handling

Actions #52

Updated by szarate about 2 years ago

  • Priority changed from High to Normal

They aren't high prio if nobody looks at them, perhaps the soft failure should be changed to: label:wontfix:xxxx

Actions #53

Updated by llzhao 7 months ago

  • Status changed from Feedback to Workable
Actions #54

Updated by acarvajal 7 months ago

llzhao wrote in #note-53:

Reopen it as there are some occurrences in OSD:
https://openqa.suse.de/tests/13380296#step/iscsi_client/9
https://openqa.suse.de/tests/13380301#step/iscsi_client/9

We're observing this only in ppc64le and only in SLES for SAP jobs. HA jobs in ppc64le do not have the issue, so it could be possibly related to qemu_ppc64le-large-mem workers.

Actions #55

Updated by acarvajal 7 months ago

https://openqa.suse.de/tests/13381522#step/iscsi_client/9
https://openqa.suse.de/tests/13381519#step/iscsi_client/9

Seems cluster nodes ran in petrol and support servers ran in mania ... and the error is resolving openqa.suse.de. Could be a MM connection issue between cluster nodes and support server.

Actions #56

Updated by acarvajal 7 months ago

  • Related to action #154552: [ppc64le] test fails in iscsi_client - zypper reports Error Message: Could not resolve host: openqa.suse.de added
Actions #57

Updated by acarvajal 7 months ago · Edited

  • Status changed from Workable to Feedback

Closing this again, as Tools Teams thinks it's a new issue. Instead filed: #154552

Actions #58

Updated by okurz 7 months ago

"Feedback" is not closed. The ticket was open since openqa_review opened it over a year ago in #95788-51 and it's in the scope of qe-sap as visible in the subject.

Actions

Also available in: Atom PDF