Project

General

Profile

Actions

coordination #96185

closed

[epic] Multimachine failure rate increased

Added by dzedro almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-07-29
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Observation

There are much more MM failures than, I guess, 1-2 weeks before.
In this case I generally speak about two node MM jobs.

There is also category of three node MM jobs which is failing very often since ever, for Maintenance it's HA/SAP jobs.
As example I randomly took "always" failing https://openqa.suse.de/tests/6589394#next_previous
On my instance without remote workers the test passed at first and second run. http://dzedro.suse.cz/tests/18735
Same 3 node MM test on osd 100% fail, on small instance without remote workers 0% fail
First two examples are wicked test and on one is ping failing with 50%+ packet loss, which to me looks more like network issue.

I don't know if something changed in setup or network, both can have problem.
Network issue could be related to #95299

2 node, some of today failures
https://openqa.suse.de/tests/6588464#step/t05_dynamic_addresses_xml/260
https://openqa.suse.de/tests/6588818#step/t04_bonding_broadcast/11
https://openqa.suse.de/tests/6587990#step/iscsi_client/44
https://openqa.suse.de/tests/6588107#step/boot_to_desktop/10
https://openqa.suse.de/tests/6588108#step/2_sw_multipath_s_aa/1
https://openqa.suse.de/tests/6588254#step/installation/21
https://openqa.suse.de/tests/6588391#step/scc_registration/32
https://openqa.suse.de/tests/6588713#step/welcome/11
https://openqa.suse.de/tests/6590916#step/await_install/68

3 node, one as example, but there are tens of HA/SAP failures every day
https://openqa.suse.de/tests/6591092#step/register_without_ltss/9


Subtasks 2 (0 open2 closed)

action #96260: Failed to add GRE tunnel to openqaworker10 on most OSD workers, recent regression explaining multi-machine errors? size:MResolveddheidler2021-07-29

Actions
action #99135: Provide ratio of tests by result in monitoring - by workerResolvedokurz

Actions

Related issues 5 (1 open4 closed)

Related to openQA Project - action #96191: Provide "fail-rate" of tests, especially multi-machine, in grafana size:MResolvedokurz2021-07-282021-09-29

Actions
Related to openQA Project - action #95299: Tests timeout with reason 'setup exceeded MAX_SETUP_TIME' on osd ppc64le workers auto_review:"Result: timeout":retry size:MResolvedmkittler2021-07-09

Actions
Related to openQA Tests - action #95824: [qe-sap][ha][shap] test fails in register_system - unable to download license, likely network configuration problem in multi-machine cluster?Rejected2021-07-22

Actions
Related to openQA Tests - action #95801: [qe-sap][ha][css][shap] test fails in register_system of multi-machine HA tests, failing to access networkRejected2021-07-21

Actions
Related to openQA Tests - action #95788: [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network Feedback2021-07-21

Actions
Actions

Also available in: Atom PDF