coordination #96185
closed[epic] Multimachine failure rate increased
Description
Observation¶
There are much more MM failures than, I guess, 1-2 weeks before.
In this case I generally speak about two node MM jobs.
There is also category of three node MM jobs which is failing very often since ever, for Maintenance it's HA/SAP jobs.
As example I randomly took "always" failing https://openqa.suse.de/tests/6589394#next_previous
On my instance without remote workers the test passed at first and second run. http://dzedro.suse.cz/tests/18735
Same 3 node MM test on osd 100% fail, on small instance without remote workers 0% fail
First two examples are wicked test and on one is ping failing with 50%+ packet loss, which to me looks more like network issue.
I don't know if something changed in setup or network, both can have problem.
Network issue could be related to #95299
2 node, some of today failures
https://openqa.suse.de/tests/6588464#step/t05_dynamic_addresses_xml/260
https://openqa.suse.de/tests/6588818#step/t04_bonding_broadcast/11
https://openqa.suse.de/tests/6587990#step/iscsi_client/44
https://openqa.suse.de/tests/6588107#step/boot_to_desktop/10
https://openqa.suse.de/tests/6588108#step/2_sw_multipath_s_aa/1
https://openqa.suse.de/tests/6588254#step/installation/21
https://openqa.suse.de/tests/6588391#step/scc_registration/32
https://openqa.suse.de/tests/6588713#step/welcome/11
https://openqa.suse.de/tests/6590916#step/await_install/68
3 node, one as example, but there are tens of HA/SAP failures every day
https://openqa.suse.de/tests/6591092#step/register_without_ltss/9
Updated by okurz over 3 years ago
- Related to action #96191: Provide "fail-rate" of tests, especially multi-machine, in grafana size:M added
Updated by okurz over 3 years ago
- Related to action #95299: Tests timeout with reason 'setup exceeded MAX_SETUP_TIME' on osd ppc64le workers auto_review:"Result: timeout":retry size:M added
Updated by okurz over 3 years ago
- Related to action #95824: [qe-sap][ha][shap] test fails in register_system - unable to download license, likely network configuration problem in multi-machine cluster? added
Updated by okurz over 3 years ago
- Related to action #95801: [qe-sap][ha][css][shap] test fails in register_system of multi-machine HA tests, failing to access network added
Updated by okurz over 3 years ago
- Related to action #95788: [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network added
Updated by okurz over 3 years ago
- Project changed from QA (public) to openQA Project (public)
- Description updated (diff)
- Category set to Regressions/Crashes
- Priority changed from Normal to High
- Target version set to Ready
Thanks for your ticket. Just yesterday I came up with #96191 as related as well. Last week I already found multiple network related problems in multi-machine tests. I linked these as related.
However, all the different kind of multi-machine tests are quite different and it's again an area that SUSE QE Tools team members have not much experience with. So I don't see that we can offer much help from SUSE QE Tools side. Basically I would hope that the multi-machine test experts would assemble and look into the problem together.
EDIT: I asked for help in https://chat.suse.de/channel/testing?msg=F6s78REbS5XRXjpQ3
Updated by asmorodskyi over 3 years ago
I know that it does not help much , but I would remove this job https://openqa.suse.de/tests/6588107#step/boot_to_desktop/10 from description . boot to desktop failure hardly likely may be related to MM infra problems
Updated by asmorodskyi over 3 years ago
this two also looks like unrelated to MM -
https://openqa.suse.de/tests/6588108#step/2_sw_multipath_s_aa/1
https://openqa.suse.de/tests/6588254#step/installation/18
I think it is important to solve one problem at the time
Updated by okurz over 3 years ago
- Tracker changed from action to coordination
- Subject changed from Multimachine failure rate increased to [epic] Multimachine failure rate increased
- Status changed from New to Blocked
- Assignee set to okurz
Agreed. Making this an epic.
@asmorodskyi helped to identify one issue about a failed GRE tunnel creation: #96260
I would be happy to see more feedback and problem analysis but at least we should rule out that #96260 is the cause of many or most of the failed examples.
Updated by dzedro over 3 years ago
asmorodskyi wrote:
I know that it does not help much , but I would remove this job https://openqa.suse.de/tests/6588107#step/boot_to_desktop/10 from description . boot to desktop failure hardly likely may be related to MM infra problems
No, it's MM test, one node is preparing PXE second node is booting from it, it can be easily MM or network problem.
Updated by dzedro over 3 years ago
asmorodskyi wrote:
this two also looks like unrelated to MM -
https://openqa.suse.de/tests/6588108#step/2_sw_multipath_s_aa/1
https://openqa.suse.de/tests/6588254#step/installation/18I think it is important to solve one problem at the time
First one can be whatever, second one is not MM, it could be related to network issues, they can be of course removed.
Adding new MM failures every day is no problem at all, unfortunately.
This were just examples of MM or network failures, only MM failures can be used, but when single job tests are failing due to network then MM can fail due to network or MM or both.
Updated by dzedro over 3 years ago
Yesterday evening I restarted two MM clusters, node1 & node2 on this cluster run on openqaworker10, I would not care about this detail, I picked this tests as they fail always. I will restart also another cluster running on random worker and collect tcpdump. Here are tcpdumps and some info from workers when the failures happened. ftp://10.100.12.155/MM/
I did do manual retry on qam_ha_rolling_upgrade_migration
, the reason test was failing on multiple places is failure in name resolution, this fail happened also on scc registration, unfortunately video is not there to see it, but I had to go to network setup and make installation to re-setup the DHCP setup again, sometimes multiple times. Generally the MM/network/DNS was unable to resolve addresses like scc.suse.com or updates.suse.com. https://openqa.suse.de/tests/6628646#step/register_without_ltss/10 when I retried it passed.
qam_alpha_cluster
https://openqa.suse.de/tests/6628644
https://openqa.suse.de/tests/6628643
https://openqa.suse.de/tests/6628642
qam_ha_rolling_upgrade_migration
https://openqa.suse.de/tests/6628647
https://openqa.suse.de/tests/6628646
https://openqa.suse.de/tests/6628645
Updated by okurz over 3 years ago
Sorry, I did not understand your last comment. Did you want to report about what you did or what you open to do? So do you think it would help to exclude openqaworker10 from all tests (not only multi-machine) and resolve #96260 first?
Updated by okurz over 3 years ago
I tried to find the "fail ratio per worker" with SQL but so far have not found a good approach. Maybe something along the lines of:
select count(jobs.id),workers.host from jobs left join workers on jobs.assigned_worker_id = workers.id join (select count(j.id) as failed_jobs_count,host from jobs j join workers w on j.assigned_worker_id = w.id group by w.host) failed_jobs on workers.host = failed_jobs.host group by workers.host order by count desc;
might work. I guess I need to either conduct a subquery as join or the other way around.
Now did it semi-automatic. All failed per worker host:
select count(jobs.id),host from jobs left join workers on jobs.assigned_worker_id = workers.id where result = 'failed' group by host order by count desc;
count | host
-------+---------------------
13660 |
11749 | openqaworker5
11173 | grenache-1
8573 | openqaworker2
7271 | openqaworker6
5727 | openqaworker9
5405 | openqaworker13
5349 | openqaworker8
4869 | openqaworker-arm-2
4619 | openqaworker3
3200 | openqaworker10
3016 | openqaworker-arm-1
3005 | openqaworker-arm-3
2953 | QA-Power8-4-kvm
2415 | QA-Power8-5-kvm
2187 | powerqaworker-qam-1
1304 | malbec
360 | automotive-3
(18 rows)
and all
select count(jobs.id),host from jobs left join workers on jobs.assigned_worker_id = workers.id group by host order by count desc;
count | host
--------+---------------------
182809 |
76185 | openqaworker5
58381 | openqaworker6
44507 | openqaworker9
41977 | openqaworker8
38853 | grenache-1
38732 | openqaworker-arm-2
37838 | openqaworker3
32992 | openqaworker13
28221 | openqaworker-arm-3
27405 | openqaworker2
21533 | openqaworker10
20928 | openqaworker-arm-1
18804 | QA-Power8-5-kvm
18454 | QA-Power8-4-kvm
16240 | powerqaworker-qam-1
9627 | malbec
1397 | automotive-3
287 | openqaworker11
(19 rows)
and all failed jobs regardless of worker host:
select count(jobs.id) from jobs where result = 'failed';
count
-------
96835
(1 row)
and all jobs regardless of worker host:
select count(jobs.id) from jobs;
count
--------
715170
(1 row)
so the total fail ratio is 96835/715170 = 13.54%, the openqaworker10 specific fail ratio is 14.86% . An arbitrary other example is openqaworker5 with 15.42%, for grenache-1 28.76% (!). So as conclusion openqaworker10 does not show any significantly different fail ratio itself hence I don't think we need to exclude it.
Updated by dzedro over 3 years ago
okurz wrote:
Sorry, I did not understand your last comment. Did you want to report about what you did or what you open to do? So do you think it would help to exclude openqaworker10 from all tests (not only multi-machine) and resolve #96260 first?
I reported what I did, I don't think that only openqaworker10 does have problem. Not sure how does #96260 affect the issue, GRE tunnel is either added or not, if not there is no connection between workers, but failures I see are more like random connection drops.
Updated by dzedro over 3 years ago
I tried to install/use ovs-test to debug the openvswitch.
But looks like the package openvswitch-test
is broken.
# ovs-test -h
File "/usr/bin/ovs-test", line 45
print "Node %s:%u " % (node[0], node[1])
^
SyntaxError: invalid syntax
There is also ovs-tcpdump or other ovs-* tools, but I didn't try if they work.
MM jobs are not failing so frequently anymore when the ticket was created, but there are still MM failures. Maybe it's network, I don't know.
Updated by okurz about 3 years ago
- Description updated (diff)
I monitored this topic over the past months. I can see that "wicked" tests are very stable so I doubt there is a generic problem with our backends or infrastructure left. SAP related tests are a different kind of problem, e.g. see #95458 and #95788 .
Based on graphs like https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=27&orgId=1&from=now-30d&to=now I can see that some openQA worker hosts are more likely to produce problems than others, e.g. openqaworker-arm-4 and openqaworker-arm-5. These both are specifically handled in tickets like #101048 but also they are not even enabled for multi-machine tests. The next in line after that is openqaworker2 which runs many "exotic" machines like vmware, hyperv, IPMI, s390x so that could explain the higher fail-ratio there. Since we record the data in the mentioned graph (about a month) I see only a significant increase in fail-ratio in openqaworker-arm-4/5, which I mentioned already. For other hosts it stays same or reduced.
Looking at https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=24&from=1632016393108&to=1635345865912 I can see no significant change over the reporting period of the last month. Mostly multi-machine tests are obsoleted, next biggest section is "passed".
Updated by okurz about 3 years ago
- Status changed from Blocked to Resolved
No response. Assuming fixed for the specific issue at hand. In any case we have better monitoring now so we should have a better chance to detect such issues in the near-future. Also there are currently other related tickets still open with more specific information. See the related tasks.
Updated by dzedro about 3 years ago
Sorry missed the comment, yes I agree that the failure rate of MM is much better.
Existing or new failures should be handled separately.