action #95788: [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network - openQA Tests (public) - openSUSE Project Management Tool

Actions

Copy link

action #95788

open

[qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network

Added by okurz almost 4 years ago. Updated 4 months ago.

Status:

Feedback

Priority:

Normal

Assignee:

Category:

Bugs in existing tests

Target version:

Start date:

2021-07-21

Due date:

% Done:

Estimated time:

Difficulty:

Description

Observation¶

openQA test in scenario sle-15-SP2-Server-DVD-HA-Incidents-x86_64-qam_ha_priority_fencing_node01@64bit fails in
iscsi_client
or other modules being unable to resolve DNS.

Reproducible¶

Fails sporadically.

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#95788

Expected result¶

Ability to resolve download.suse.com and other hosts within curl or zypper calls

Further details¶

Always latest result in this scenario: latest

Suggestions¶

Someone from Tools and from SAP pair up to debug this
~~Ask maintainer (Loic)~~
get familiar with iscsi (storage over TCP)

Related issues 5 (1 open — 4 closed)

Related to openQA Tests (public) - action #95458: [qe-sap][ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry

Feedback

acarvajal

2021-07-13

Actions

Related to openQA Tests (public) - action #95801: [qe-sap][ha][css][shap] test fails in register_system of multi-machine HA tests, failing to access network

Rejected

2021-07-21

Actions

Related to openQA Project (public) - coordination #96185: [epic] Multimachine failure rate increased

Resolved

okurz

2021-07-29

Actions

Related to openQA Project (public) - action #69976: Show dependency graph for cloned jobs

Resolved

mkittler

2020-08-13

Actions

Related to openQA Project (public) - action #154552: [ppc64le] test fails in iscsi_client - zypper reports Error Message: Could not resolve host: openqa.suse.de

Resolved

mkittler

2024-01-30

Actions

Copy link

Updated by okurz almost 4 years ago

Related to action #95458: [qe-sap][ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)tests/ha.*(command.*timed out|Test died).*match=root-console timed out":retry added

Actions

Copy link

Updated by okurz almost 4 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz almost 4 years ago

Subject changed from [ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*command.*curl":retry to [ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*command.*curl.*failed":retry

Actions

Copy link

Updated by okurz almost 4 years ago

Subject changed from [ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*command.*curl.*failed":retry to [ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*post_fail_hook failed: command[^$]*curlcommand.*curl":retry

Actions

Copy link

Updated by okurz almost 4 years ago

Subject changed from [ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*post_fail_hook failed: command[^$]*curlcommand.*curl":retry to [ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*post_fail_hook failed: command[^$].*curl":retry

Actions

Copy link

Updated by okurz almost 4 years ago

Subject changed from [ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*post_fail_hook failed: command[^$].*curl":retry to [ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*post_fail_hook failed: command.*curl":retry

Actions

Copy link

Updated by okurz almost 4 years ago

Updated the auto-review regex to still match on https://openqa.suse.de/tests/6491214/ but not on https://openqa.suse.de/tests/6426081 which is about #95458 instead

Actions

Copy link

Updated by okurz almost 4 years ago

Related to action #95801: [qe-sap][ha][css][shap] test fails in register_system of multi-machine HA tests, failing to access network added

Actions

Copy link

Updated by acarvajal almost 4 years ago

While reviewing SLES+HA and SLES for SAP Applications QR build 188.13 results, I ran into several jobs that failed with this issue.

All jobs have been tagged with this poo#. These are:

Between round brackets: architecture - module that failed - reason.

https://openqa.suse.de/tests/6590829#step/ha_cluster_join/8 (aarch64 - ha/ha_cluster_join - cannot reach the other node with ha-cluster-join)
https://openqa.suse.de/tests/6590692#step/ha_cluster_join/8 (ppc64le - ha/ha_cluster_join - cannot reach the other node with ha-cluster-join)
https://openqa.suse.de/tests/6590752#step/iscsi_client/9 (x86_64 - ha/iscsi_client - cannot resolve updates.suse.com)
https://openqa.suse.de/tests/6590755#step/ha_cluster_join/6 (x86_64 - ha/ha_cluster_join - cannot reach the other node with ping)
https://openqa.suse.de/tests/6590738#step/iscsi_client/9 (ppc64le - ha/iscsi_client - cannot resolve updates.suse.com)

Actions

Copy link

#10

Updated by okurz almost 4 years ago

Related to coordination #96185: [epic] Multimachine failure rate increased added

Actions

Copy link

#11

Updated by openqa_review almost 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_hawk_haproxy_node02
https://openqa.suse.de/tests/6642090

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The label in the openQA scenario is removed

Actions

Copy link

#12

Updated by acarvajal almost 4 years ago

Subject changed from [ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*post_fail_hook failed: command.*curl":retry to [ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry

Actions

Copy link

#13

Updated by acarvajal almost 4 years ago

Updated auto_review to cover https://openqa.suse.de/tests/6887791#step/ha_cluster_join/7 as well.

Actions

Copy link

#14

Updated by openqa_review over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: xfstests_btrfs-generic-401-999
https://openqa.suse.de/tests/6982589

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The label in the openQA scenario is removed

Actions

Copy link

#15

Updated by okurz over 3 years ago

Subject changed from [ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry to [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry

Using keyword "qe-sap" as verified by jmichel in weekly QE sync 2021-09-15

Actions

Copy link

#16

Updated by openqa_review over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: select_modules_and_patterns
https://openqa.suse.de/tests/7261466

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Actions

Copy link

#17

Updated by openqa_review over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: xfstests_btrfs-btrfs-151-999
https://openqa.suse.de/tests/7362281

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Actions

Copy link

#18

Updated by okurz over 3 years ago

Please see my proposal to remove the failing test modules from the schedule until the issue could be resolved:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/13475

Actions

Copy link

#19

Updated by okurz over 3 years ago

Subject changed from [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry to [tools][qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry
Due date set to 2021-10-20
Priority changed from Normal to Urgent
Target version set to Ready

As proposed by vpelcak we are looking for a volunteer from both SUSE QE Tools as well as QE SAP to collaborate and fix until next Tue EOB, otherwise the according tests should be disabled, e.g. as proposed in https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/13475

Actions

Copy link

#20

Updated by livdywan over 3 years ago

Description updated (diff)
Status changed from New to Workable
Assignee set to livdywan

I'm taking a look, and will query who might help as a domain expert

Actions

Copy link

#21

Updated by livdywan over 3 years ago

okurz wrote:

Observation¶

openQA test in scenario sle-15-SP2-Server-DVD-HA-Incidents-x86_64-qam_ha_priority_fencing_node01@64bit fails in
iscsi_client

testapi::assert_script_run("curl --form upload=\@/var/log/zypper.log --form upname=iscsi_c"..., 90)

Reproducible¶

Fails sporadically.

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#95788

This unfortunately gives me some invalid job ID's. A couple that worked with the according errors:

https://openqa.suse.de/tests/7451408

sulogin: tcgetattr failed: Input/output error

https://openqa.suse.de/tests/7443999

Bad Request (400)
googleapi: Error 400: Precondition check failed., failedPrecondition

Further details¶

Always latest result in this scenario: latest

Looking at previous results, I just see softfails there.

Actions

Copy link

#22

Updated by acarvajal over 3 years ago

okurz wrote:

Please see my proposal to remove the failing test modules from the schedule until the issue could be resolved:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/13475

This is a bad idea.

For starters, the PR is changing the contents of schedule/ha/qam/common/qam_ha_rolling_upgrade_migration_node01.yaml but the referenced test uses schedule/ha/qam/common/qam_ha_rolling_upgrade_migration_node01_sle12.yaml instead.

But even if the wrong schedule issue was addressed in the PR, and while it could have the intended outcome for the job linked in the PR, that very same job is not a standalone job, so commenting those modules from the appropriate schedule will just move the failure to any of the other jobs in the MM cluster. In short, it would trade an sporadic failure[1], for a permanent one.

If the PR is merged as is (with the wrong schedule), not only will it not impact the linked job, but it would introduce issues where the schedule is currently in use:

The schedule is present in https://gitlab.suse.de/acarvajal/qam-openqa-yml/-/blob/master/JobGroups/qam_qu/sle15sp2.yml#L1521 (15-SP2 QU).[2]
It's also configured as a setting in the qam_ha_rolling_upgrade_migration_node01 test suite in osd.
The qam_ha_rolling_upgrade_migration_node01 is used in the QAM TestRepo job groups for 12-SP3, 12-SP4, 12-SP5, 15-SP1, 15-SP2 and 15-SP3, but in most of those the YAML_SCHEDULE setting is being overwritten via the Job Group configuration. AFAICS, the schedule is however used in 15-SP2 TestRepo job group[3], quite successfully I might add.

On the other hand, if the PR is merged with the right schedule, the linked node 1 job may pass, but I see 2 possible outcomes for the related jobs:

Either node 2 will fail timing out in a lockapi::barrier_wait() call waiting for barrier_wait() calls on the same barrier from the other jobs (for example, in https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/ha/check_hawk.pm#L69)
Support server will time out after reaching MAX_JOB_TIME while waiting for all child jobs to finish; node 1 would finish of course thanks to the PR, but node 2 will stay "forever" in a lockapi::barrier_wait() call.

Another thing to consider is that this scenario is testing rolling upgrade, i.e.:

Configure 2 nodes with the HA stack.
Add some resources to the HA cluster.
Stop cluster on node 1
Upgrade node 1 to the next SP version of SLES+HA
Start cluster on node 1
Check HA cluster health
Stop cluster on node 2
Upgrade node 2 to the next SP version of SLES+HA
Start cluster on node 2
Check HA cluster health

By commenting the test modules in the PR, no cluster health is being checked (step 6 in the list above) after node 1 migration. In essence, the job will remain, it may pass (but not the other jobs in the MM setup) but the job itself will not be testing anything relevant. It will of course test cluster setup in a previous version of SLES+HA, which is already covered elsewhere.

Finally, the correct YAML schedule at schedule/ha/qam/common/qam_ha_rolling_upgrade_migration_node01_sle12.yaml is not only used in the 12-SP4 HA Single Incidents job group where the linked test is located. It's also in use in 12-SP3 and 12-SP5:

acarvajal:~/git/qam-openqa-yml [master|✔] > grep -r schedule/ha/qam/common/qam_ha_rolling_upgrade_migration_node01_sle12.yaml
JobGroups/test_repo/sle12sp3.yml:          YAML_SCHEDULE: 'schedule/ha/qam/common/qam_ha_rolling_upgrade_migration_node01_sle12.yaml'
JobGroups/test_repo/sle12sp4.yml:          YAML_SCHEDULE: 'schedule/ha/qam/common/qam_ha_rolling_upgrade_migration_node01_sle12.yaml'
JobGroups/test_repo/sle12sp5.yml:          YAML_SCHEDULE: 'schedule/ha/qam/common/qam_ha_rolling_upgrade_migration_node01_sle12.yaml'

In 12-SP5 test has ca. 89% success rate[4] at the time of this writing, while in 12-SP3 it has a 82% success rate[5].

In conclusion, removing/commenting test modules from the schedule, will break working tests in 12-SP3 and 12-SP5 QAM TestRepo job groups, while in 12-SP4 it will simply move an sporadic failure in node 1, to a permanent failure in node 2 or the support server.

If decision is to remove the test, better to do it by commenting related node 1, node 2 and supportserver jobs in https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/blob/master/JobGroups/test_repo/sle12sp4.yml

This is what was done in 15-SP1 due to bsc#1183744: https://gitlab.suse.de/acarvajal/qam-openqa-yml/-/blob/master/JobGroups/test_repo/sle15sp1.yml#L316-337

[1] Yes! 70% is still sporadic. It means the test is passing 30% of the time.
[2] https://openqa.suse.de/tests/7288981#next_previous ... 75% success rate, but only 4 jobs.
[3] https://openqa.suse.de/tests/7451366#next_previous ... over 90% success rate.
[4] https://openqa.suse.de/tests/7451334#next_previous
[5] https://openqa.suse.de/tests/7451314#next_previous

Actions

Copy link

#23

Updated by okurz over 3 years ago

If decision is to remove the test, better to do it by commenting related node 1, node 2 and supportserver jobs in https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/blob/master/JobGroups/test_repo/sle12sp4.yml

Sounds good. So if no fix for the underlying issues could be found until EOB tomorrow then this schedule change should be ready to be merged at that time.

Actions

Copy link

#24

Updated by livdywan over 3 years ago

Ricardo kindly helped me understand a little better what's happening here, and I took some notes on what questions came up:

https://openqa.suse.de/tests/7221118#settings
- There's no Dependencies tab here. I would expect to see node02 and support-server which can be seen e.g. on https://openqa.suse.de/tests/7224107#dependencies (which passed).
- Node qam-node02 (...) UNCLEAN (offline) stands out as the most relevant error output
  - I don't know what unclean means or how the test tries to access qam-node02 and how it fails
  - This seems to originate in crm_mon -R -r -n -N -1 | grep -i 'no inactive resources'
  - A successful run seems to include an Inactive Resources: section
  - Trying crm_mon -R -r -n -N -1 on a cluster provided by Ricardo seems to have things like * rsc_ip_PRD_HDB00_start_0 on hana02 'error' (1): call=35, status='Timed Out', exitreason='', last-rc-change='2021-09-19 17:51:05 +02:00', queued=0ms, exec=20001ms, where 'error' (1): call=40, status='Timed Out' stands out to me as an error
- TIMEOUT_SCALE 3 in job settings should mean 50 seconds times 3 meaning 150s for this job. Might make sense to increase the factor?

@acarvajal maybe you or somebody else can comment on the points above? In particular what UNCLEAN means and how the test checks it and why there's no error output there e.g. timed out or unreachable

Actions

Copy link

#25

Updated by okurz over 3 years ago

cdywan wrote:

Ricardo kindly helped me understand a little better what's happening here, and I took some notes on what questions came up:

https://openqa.suse.de/tests/7221118#settings

There's no Dependencies tab here. I would expect to see node02 and support-server which can be seen e.g. on https://openqa.suse.de/tests/7224107#dependencies (which passed).

jobs which are cloned cannot consistently resolve dependencies hence this won't show there. There is a feature request about this, couldn't find the ticket right now

Actions

Copy link

#26

Updated by livdywan over 3 years ago

acarvajal wrote:

If decision is to remove the test, better to do it by commenting related node 1, node 2 and supportserver jobs in https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/blob/master/JobGroups/test_repo/sle12sp4.yml

https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/merge_requests/193

Here's my attempt, following your suggestion, in case we won't have a fix by EOD.

Actions

Copy link

#27

Updated by livdywan over 3 years ago

cdywan wrote:

Node qam-node02 (...) UNCLEAN (offline) stands out as the most relevant error output

I don't know what unclean means or how the test tries to access qam-node02 and how it fails

This seems to originate in crm_mon -R -r -n -N -1 | grep -i 'no inactive resources'

A successful run seems to include an Inactive Resources: section

Trying crm_mon -R -r -n -N -1 on a cluster provided by Ricardo seems to have things like * rsc_ip_PRD_HDB00_start_0 on hana02 'error' (1): call=35, status='Timed Out', exitreason='', last-rc-change='2021-09-19 17:51:05 +02:00', queued=0ms, exec=20001ms, where 'error' (1): call=40, status='Timed Out' stands out to me as an error

Btw this is in lib/hacluster.pm in check_cluster_state which conditionally greps for 'no inactive resources'. And I notice the crm_verify -LV is also conditionally fatal. Maybe this should not fail the test? I don't understand why it's fatal only in some cases, though, so this may be totally wrong.

TIMEOUT_SCALE 3 in job settings should mean 50 seconds times 3 meaning 150s for this job. Might make sense to increase the factor?

I couldn't actually find where this is set. I can only see it in the yaml for other tests.

Actions

Copy link

#28

Updated by acarvajal over 3 years ago

cdywan wrote:

Ricardo kindly helped me understand a little better what's happening here, and I took some notes on what questions came up:

https://openqa.suse.de/tests/7221118#settings

There's no Dependencies tab here. I would expect to see node02 and support-server which can be seen e.g. on https://openqa.suse.de/tests/7224107#dependencies (which passed).

Node qam-node02 (...) UNCLEAN (offline) stands out as the most relevant error output

Agree. Taken from the short description above for the test:

Configure 2 nodes with the HA stack.
Add some resources to the HA cluster.
Stop cluster on node 1
Upgrade node 1 to the next SP version of SLES+HA
Start cluster on node 1
Check HA cluster health
Stop cluster on node 2
Upgrade node 2 to the next SP version of SLES+HA
Start cluster on node 2
Check HA cluster health

It seems this is happening during step 6, i.e., node 1 has just been migrated to the next SP, cluster has been restarted on that node (it would start automatically after the reboot), but then it's finding the other node unhealthy/unclean.

As to the root cause, I would think either a product bug, a communication issue between both nodes or some race condition.

Not sure increasing the timeout would help as node 2 should always be available during node 1 migration.

- I don't know what unclean means or how the test tries to access qam-node02 and how it fails
- This seems to originate in `crm_mon -R -r -n -N -1 | grep -i 'no inactive resources'`

Node 2 is unclean, there are inactive resources, so the test fails.

- A successful run seems to include an `Inactive Resources:` section

Failing test also includes the section. If you see it lists a lot of inactive resources there: https://openqa.suse.de/tests/7221118#step/check_cluster_integrity/6

What successful test should include is an empty Inactive Resources: section.

- Trying `crm_mon -R -r -n -N -1` on a cluster provided by Ricardo seems to have things like `* rsc_ip_PRD_HDB00_start_0 on hana02 'error' (1): call=35, status='Timed Out', exitreason='', last-rc-change='2021-09-19 17:51:05 +02:00', queued=0ms, exec=20001ms`, where `'error' (1): call=40, status='Timed Out'` stands out to me as an error

Different type of cluster/scenario. That error is seen on HANA clusters after a site takeover/takeback. You can see it in successful test for example at: https://openqa.suse.de/tests/7430859#step/check_after_reboot#1/15

Test modules handle that error (registers fenced HANA node for system replication again in the cluster) and test continues.

This scenario (rolling upgrade) is not using HANA.

TIMEOUT_SCALE 3 in job settings should mean 50 seconds times 3 meaning 150s for this job. Might make sense to increase the factor?

I think it can be tested with an increased timeout just to confirm whether it helps or not, but my hunch is that it will not help.

Actions

Copy link

#29

Updated by livdywan over 3 years ago

Status changed from Workable to Feedback

cdywan wrote:

https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/merge_requests/193

Here's my attempt, following your suggestion, in case we won't have a fix by EOD.

The above MR was reviewed and merged.

There was a suggestion in chat to have the test in a development group. Due the concerns over breaking other tests I've not tried that.

I'm thinking if we want this or a new ticket to continue the investigation of the failures.

Actions

Copy link

#30

Updated by okurz over 3 years ago

Due date deleted (~~2021-10-20~~)
Status changed from Feedback to Workable
Assignee deleted (~~livdywan~~)
Priority changed from Urgent to High
Target version deleted (~~Ready~~)

better continue here. But I think at this point it's better for QE SAP to decide how to go on, what to cover manually, what to fix in tests, where to test it, etc. @cdywan thanks for your help. Removing you from assignee and reducing prio after the urgent issue was addressed.

Actions

Copy link

#31

Updated by livdywan over 3 years ago

Related to action #69976: Show dependency graph for cloned jobs added

Actions

Copy link

#32

Updated by okurz over 3 years ago

Subject changed from [tools][qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry to [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry

Actions

Copy link

#33

Updated by okurz over 3 years ago

Description updated (diff)

Actions

Copy link

#34

Updated by openqa_review over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos-extratest
https://openqa.suse.de/tests/7350856

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Actions

Copy link

#35

Updated by openqa_review over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: toolchain_zypper
https://openqa.suse.de/tests/7728612

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Actions

Copy link

#36

Updated by openqa_review over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos-base+phub
https://openqa.suse.de/tests/7802298

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Actions

Copy link

#37

Updated by asmorodskyi over 3 years ago

Subject changed from [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry to [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)(tests/ha_cluster_join|tests/iscsi_client).*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry

Actions

Copy link

#38

Updated by okurz over 3 years ago

Subject changed from [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)(tests/ha_cluster_join|tests/iscsi_client).*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry to [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)(tests/ha/ha_cluster_join|tests/iscsi/iscsi_client).*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry

paths like "tests/ha_cluster_join|tests/iscsi_client" don't exist. In os-autoinst-distri-opensuse there are paths like "tests/ha/ha_cluster_join.pm" and "tests/iscsi/iscsi_client.pm"

Actions

Copy link

#39

Updated by openqa_review over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: toolchain_zypper
https://openqa.suse.de/tests/7925816

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Actions

Copy link

#40

Updated by openqa_review over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: toolchain_zypper
https://openqa.suse.de/tests/7976119

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Actions

Copy link

#41

Updated by openqa_review over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_gamma_node03
https://openqa.suse.de/tests/8044170

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Actions

Copy link

#42

Updated by asmorodskyi over 3 years ago

Subject changed from [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)(tests/ha/ha_cluster_join|tests/iscsi/iscsi_client).*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry to [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network

removing autoreview due to false labeling https://openqa.suse.de/tests/8109535#comments

Actions

Copy link

#43

Updated by openqa_review over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: rsync-client
https://openqa.suse.de/tests/8197348

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Actions

Copy link

#44

Updated by rbranco about 3 years ago

Status changed from Workable to Resolved

This ticket contains totally unrelated tests.

From the HA/SAP side we could fix with:

https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/merge_requests/231
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14457

Actions

Copy link

#45

Updated by okurz about 3 years ago

I am not sure if this will stay true. For example the latest job in https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Online&machine=64bit&test=rsync-client&version=15-SP4#next_previous , related to the last linked job in comments, was on 2022-03-08, it passed, but this was a sporadic issue. And no test was conducted since then. I wouldn't be so sure this can't happen again but I am crossing fingers as well :)

Actions

Copy link

#46

Updated by openqa_review about 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_delta_node02
https://openqa.suse.de/tests/8445292#step/ha_cluster_join/1

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions

Copy link

#47

Updated by openqa_review about 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_2nodes_02
https://openqa.suse.de/tests/8630040#step/ha_cluster_join/1

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 60 days if nothing changes in this ticket.

Actions

Copy link

#49

Updated by openqa_review about 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_2nodes_02
https://openqa.suse.de/tests/8706249#step/ha_cluster_join/1

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions

Copy link

#50

Updated by openqa_review almost 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam_2nodes_02
https://openqa.suse.de/tests/8915222#step/ha_cluster_join/1

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions

Copy link

#51

Updated by openqa_review almost 3 years ago

Status changed from Resolved to Feedback

Re-opening tickets with unhandled openqa-review reminder comment, see https://progress.opensuse.org/projects/openqatests/wiki/Wiki#openqa-review-reminder-handling

Actions

Copy link

#52

Updated by szarate almost 3 years ago

Priority changed from High to Normal

They aren't high prio if nobody looks at them, perhaps the soft failure should be changed to: label:wontfix:xxxx

Actions

Copy link

#53

Updated by llzhao over 1 year ago

Status changed from Feedback to Workable

Reopen it as there are some occurrences in OSD:
https://openqa.suse.de/tests/13380296#step/iscsi_client/9
https://openqa.suse.de/tests/13380301#step/iscsi_client/9

Actions

Copy link

#54

Updated by acarvajal over 1 year ago

llzhao wrote in #note-53:

Reopen it as there are some occurrences in OSD:
https://openqa.suse.de/tests/13380296#step/iscsi_client/9
https://openqa.suse.de/tests/13380301#step/iscsi_client/9

We're observing this only in ppc64le and only in SLES for SAP jobs. HA jobs in ppc64le do not have the issue, so it could be possibly related to qemu_ppc64le-large-mem workers.

Actions

Copy link

#55

Updated by acarvajal over 1 year ago

https://openqa.suse.de/tests/13381522#step/iscsi_client/9
https://openqa.suse.de/tests/13381519#step/iscsi_client/9

Seems cluster nodes ran in petrol and support servers ran in mania ... and the error is resolving openqa.suse.de. Could be a MM connection issue between cluster nodes and support server.

Actions

Copy link

#56

Updated by acarvajal over 1 year ago

Related to action #154552: [ppc64le] test fails in iscsi_client - zypper reports Error Message: Could not resolve host: openqa.suse.de added

Actions

Copy link

#57

Updated by acarvajal over 1 year ago · Edited

Status changed from Workable to Feedback

Closing this again, as Tools Teams thinks it's a new issue. Instead filed: #154552

Actions

Copy link

#58

Updated by okurz over 1 year ago

"Feedback" is not closed. The ticket was open since openqa_review opened it over a year ago in #95788-51 and it's in the scope of qe-sap as visible in the subject.

Actions

Copy link

#59

Updated by slo-gin 4 months ago

This ticket was set to Normal priority but was not updated within the SLO period. Please consider picking up this ticket or just set the ticket to the next lower priority.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Tests (public)

Tags

Custom queries

action #95788

[qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network

Observation¶

Reproducible¶

Expected result¶

Further details¶

Suggestions¶

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by acarvajal almost 4 years ago

Updated by okurz almost 4 years ago

Updated by openqa_review almost 4 years ago

Updated by acarvajal almost 4 years ago

Updated by acarvajal almost 4 years ago

Updated by openqa_review over 3 years ago

Updated by okurz over 3 years ago

Updated by openqa_review over 3 years ago

Updated by openqa_review over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by livdywan over 3 years ago

Updated by livdywan over 3 years ago

Observation¶

Reproducible¶

Further details¶

Updated by acarvajal over 3 years ago

Updated by okurz over 3 years ago

Updated by livdywan over 3 years ago

Updated by okurz over 3 years ago

Updated by livdywan over 3 years ago

Updated by livdywan over 3 years ago

Updated by acarvajal over 3 years ago

Updated by livdywan over 3 years ago

Updated by okurz over 3 years ago

Updated by livdywan over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by openqa_review over 3 years ago

Updated by openqa_review over 3 years ago

Updated by openqa_review over 3 years ago

Updated by asmorodskyi over 3 years ago

Updated by okurz over 3 years ago

Updated by openqa_review over 3 years ago

Updated by openqa_review over 3 years ago

Updated by openqa_review over 3 years ago

Updated by asmorodskyi over 3 years ago

Updated by openqa_review over 3 years ago

Updated by rbranco about 3 years ago

Updated by okurz about 3 years ago

Updated by openqa_review about 3 years ago

Updated by openqa_review about 3 years ago

Updated by openqa_review about 3 years ago

Updated by openqa_review almost 3 years ago

Updated by openqa_review almost 3 years ago

Updated by szarate almost 3 years ago

Updated by llzhao over 1 year ago

Updated by acarvajal over 1 year ago

Updated by acarvajal over 1 year ago

Updated by acarvajal over 1 year ago

Updated by acarvajal over 1 year ago · Edited

Updated by okurz over 1 year ago

Updated by slo-gin 4 months ago