Project

General

Profile

Actions

coordination #96185

closed

[epic] Multimachine failure rate increased

Added by dzedro over 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-07-29
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Observation

There are much more MM failures than, I guess, 1-2 weeks before.
In this case I generally speak about two node MM jobs.

There is also category of three node MM jobs which is failing very often since ever, for Maintenance it's HA/SAP jobs.
As example I randomly took "always" failing https://openqa.suse.de/tests/6589394#next_previous
On my instance without remote workers the test passed at first and second run. http://dzedro.suse.cz/tests/18735
Same 3 node MM test on osd 100% fail, on small instance without remote workers 0% fail
First two examples are wicked test and on one is ping failing with 50%+ packet loss, which to me looks more like network issue.

I don't know if something changed in setup or network, both can have problem.
Network issue could be related to #95299

2 node, some of today failures
https://openqa.suse.de/tests/6588464#step/t05_dynamic_addresses_xml/260
https://openqa.suse.de/tests/6588818#step/t04_bonding_broadcast/11
https://openqa.suse.de/tests/6587990#step/iscsi_client/44
https://openqa.suse.de/tests/6588107#step/boot_to_desktop/10
https://openqa.suse.de/tests/6588108#step/2_sw_multipath_s_aa/1
https://openqa.suse.de/tests/6588254#step/installation/21
https://openqa.suse.de/tests/6588391#step/scc_registration/32
https://openqa.suse.de/tests/6588713#step/welcome/11
https://openqa.suse.de/tests/6590916#step/await_install/68

3 node, one as example, but there are tens of HA/SAP failures every day
https://openqa.suse.de/tests/6591092#step/register_without_ltss/9


Subtasks 2 (0 open2 closed)

action #96260: Failed to add GRE tunnel to openqaworker10 on most OSD workers, recent regression explaining multi-machine errors? size:MResolveddheidler2021-07-29

Actions
action #99135: Provide ratio of tests by result in monitoring - by workerResolvedokurz

Actions

Related issues 5 (1 open4 closed)

Related to openQA Project - action #96191: Provide "fail-rate" of tests, especially multi-machine, in grafana size:MResolvedokurz2021-07-282021-09-29

Actions
Related to openQA Project - action #95299: Tests timeout with reason 'setup exceeded MAX_SETUP_TIME' on osd ppc64le workers auto_review:"Result: timeout":retry size:MResolvedmkittler2021-07-09

Actions
Related to openQA Tests - action #95824: [qe-sap][ha][shap] test fails in register_system - unable to download license, likely network configuration problem in multi-machine cluster?Rejected2021-07-22

Actions
Related to openQA Tests - action #95801: [qe-sap][ha][css][shap] test fails in register_system of multi-machine HA tests, failing to access networkRejected2021-07-21

Actions
Related to openQA Tests - action #95788: [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network Feedback2021-07-21

Actions
Actions #1

Updated by dzedro over 3 years ago

  • Description updated (diff)
Actions #2

Updated by okurz over 3 years ago

  • Related to action #96191: Provide "fail-rate" of tests, especially multi-machine, in grafana size:M added
Actions #3

Updated by okurz over 3 years ago

  • Related to action #95299: Tests timeout with reason 'setup exceeded MAX_SETUP_TIME' on osd ppc64le workers auto_review:"Result: timeout":retry size:M added
Actions #4

Updated by okurz over 3 years ago

  • Related to action #95824: [qe-sap][ha][shap] test fails in register_system - unable to download license, likely network configuration problem in multi-machine cluster? added
Actions #5

Updated by okurz over 3 years ago

  • Related to action #95801: [qe-sap][ha][css][shap] test fails in register_system of multi-machine HA tests, failing to access network added
Actions #6

Updated by okurz over 3 years ago

  • Related to action #95788: [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network added
Actions #7

Updated by okurz over 3 years ago

  • Project changed from QA to openQA Project
  • Description updated (diff)
  • Category set to Regressions/Crashes
  • Priority changed from Normal to High
  • Target version set to Ready

Thanks for your ticket. Just yesterday I came up with #96191 as related as well. Last week I already found multiple network related problems in multi-machine tests. I linked these as related.
However, all the different kind of multi-machine tests are quite different and it's again an area that SUSE QE Tools team members have not much experience with. So I don't see that we can offer much help from SUSE QE Tools side. Basically I would hope that the multi-machine test experts would assemble and look into the problem together.

EDIT: I asked for help in https://chat.suse.de/channel/testing?msg=F6s78REbS5XRXjpQ3

Actions #8

Updated by asmorodskyi over 3 years ago

I know that it does not help much , but I would remove this job https://openqa.suse.de/tests/6588107#step/boot_to_desktop/10 from description . boot to desktop failure hardly likely may be related to MM infra problems

Actions #9

Updated by asmorodskyi over 3 years ago

this two also looks like unrelated to MM -
https://openqa.suse.de/tests/6588108#step/2_sw_multipath_s_aa/1
https://openqa.suse.de/tests/6588254#step/installation/18

I think it is important to solve one problem at the time

Actions #10

Updated by okurz over 3 years ago

  • Tracker changed from action to coordination
  • Subject changed from Multimachine failure rate increased to [epic] Multimachine failure rate increased
  • Status changed from New to Blocked
  • Assignee set to okurz

Agreed. Making this an epic.

@asmorodskyi helped to identify one issue about a failed GRE tunnel creation: #96260

I would be happy to see more feedback and problem analysis but at least we should rule out that #96260 is the cause of many or most of the failed examples.

Actions #11

Updated by dzedro over 3 years ago

asmorodskyi wrote:

I know that it does not help much , but I would remove this job https://openqa.suse.de/tests/6588107#step/boot_to_desktop/10 from description . boot to desktop failure hardly likely may be related to MM infra problems

No, it's MM test, one node is preparing PXE second node is booting from it, it can be easily MM or network problem.

Actions #12

Updated by dzedro over 3 years ago

asmorodskyi wrote:

this two also looks like unrelated to MM -
https://openqa.suse.de/tests/6588108#step/2_sw_multipath_s_aa/1
https://openqa.suse.de/tests/6588254#step/installation/18

I think it is important to solve one problem at the time

First one can be whatever, second one is not MM, it could be related to network issues, they can be of course removed.
Adding new MM failures every day is no problem at all, unfortunately.
This were just examples of MM or network failures, only MM failures can be used, but when single job tests are failing due to network then MM can fail due to network or MM or both.

Actions #13

Updated by dzedro about 3 years ago

Yesterday evening I restarted two MM clusters, node1 & node2 on this cluster run on openqaworker10, I would not care about this detail, I picked this tests as they fail always. I will restart also another cluster running on random worker and collect tcpdump. Here are tcpdumps and some info from workers when the failures happened. ftp://10.100.12.155/MM/
I did do manual retry on qam_ha_rolling_upgrade_migration, the reason test was failing on multiple places is failure in name resolution, this fail happened also on scc registration, unfortunately video is not there to see it, but I had to go to network setup and make installation to re-setup the DHCP setup again, sometimes multiple times. Generally the MM/network/DNS was unable to resolve addresses like scc.suse.com or updates.suse.com. https://openqa.suse.de/tests/6628646#step/register_without_ltss/10 when I retried it passed.

qam_alpha_cluster
https://openqa.suse.de/tests/6628644
https://openqa.suse.de/tests/6628643
https://openqa.suse.de/tests/6628642

qam_ha_rolling_upgrade_migration
https://openqa.suse.de/tests/6628647
https://openqa.suse.de/tests/6628646
https://openqa.suse.de/tests/6628645

Actions #14

Updated by okurz about 3 years ago

Sorry, I did not understand your last comment. Did you want to report about what you did or what you open to do? So do you think it would help to exclude openqaworker10 from all tests (not only multi-machine) and resolve #96260 first?

Actions #15

Updated by okurz about 3 years ago

I tried to find the "fail ratio per worker" with SQL but so far have not found a good approach. Maybe something along the lines of:

select count(jobs.id),workers.host from jobs left join workers on jobs.assigned_worker_id = workers.id join (select count(j.id) as failed_jobs_count,host from jobs j join workers w on j.assigned_worker_id = w.id group by w.host) failed_jobs on workers.host = failed_jobs.host group by workers.host order by count desc;

might work. I guess I need to either conduct a subquery as join or the other way around.

Now did it semi-automatic. All failed per worker host:

select count(jobs.id),host from jobs left join workers on jobs.assigned_worker_id = workers.id where result = 'failed' group by host order by count desc;
 count |        host         
-------+---------------------
 13660 | 
 11749 | openqaworker5
 11173 | grenache-1
  8573 | openqaworker2
  7271 | openqaworker6
  5727 | openqaworker9
  5405 | openqaworker13
  5349 | openqaworker8
  4869 | openqaworker-arm-2
  4619 | openqaworker3
  3200 | openqaworker10
  3016 | openqaworker-arm-1
  3005 | openqaworker-arm-3
  2953 | QA-Power8-4-kvm
  2415 | QA-Power8-5-kvm
  2187 | powerqaworker-qam-1
  1304 | malbec
   360 | automotive-3
(18 rows)

and all

select count(jobs.id),host from jobs left join workers on jobs.assigned_worker_id = workers.id group by host order by count desc;                                                                                                                                  
 count  |        host         
--------+---------------------
 182809 | 
  76185 | openqaworker5
  58381 | openqaworker6
  44507 | openqaworker9
  41977 | openqaworker8
  38853 | grenache-1
  38732 | openqaworker-arm-2
  37838 | openqaworker3
  32992 | openqaworker13
  28221 | openqaworker-arm-3
  27405 | openqaworker2
  21533 | openqaworker10
  20928 | openqaworker-arm-1
  18804 | QA-Power8-5-kvm
  18454 | QA-Power8-4-kvm
  16240 | powerqaworker-qam-1
   9627 | malbec
   1397 | automotive-3
    287 | openqaworker11
(19 rows)

and all failed jobs regardless of worker host:

select count(jobs.id) from jobs where result = 'failed';
 count 
-------
 96835
(1 row)

and all jobs regardless of worker host:

select count(jobs.id) from jobs;                                                                                                                                                                                                                                   
 count  
--------
 715170
(1 row)

so the total fail ratio is 96835/715170 = 13.54%, the openqaworker10 specific fail ratio is 14.86% . An arbitrary other example is openqaworker5 with 15.42%, for grenache-1 28.76% (!). So as conclusion openqaworker10 does not show any significantly different fail ratio itself hence I don't think we need to exclude it.

Actions #16

Updated by dzedro about 3 years ago

okurz wrote:

Sorry, I did not understand your last comment. Did you want to report about what you did or what you open to do? So do you think it would help to exclude openqaworker10 from all tests (not only multi-machine) and resolve #96260 first?

I reported what I did, I don't think that only openqaworker10 does have problem. Not sure how does #96260 affect the issue, GRE tunnel is either added or not, if not there is no connection between workers, but failures I see are more like random connection drops.

Actions #17

Updated by okurz about 3 years ago

With #96260 resolved I added #96191 to the backlog now

Actions #18

Updated by dzedro about 3 years ago

I tried to install/use ovs-test to debug the openvswitch.
But looks like the package openvswitch-test is broken.

# ovs-test -h
  File "/usr/bin/ovs-test", line 45
    print "Node %s:%u " % (node[0], node[1])
                      ^
SyntaxError: invalid syntax

There is also ovs-tcpdump or other ovs-* tools, but I didn't try if they work.
MM jobs are not failing so frequently anymore when the ticket was created, but there are still MM failures. Maybe it's network, I don't know.

Actions #19

Updated by okurz about 3 years ago

  • Description updated (diff)

I monitored this topic over the past months. I can see that "wicked" tests are very stable so I doubt there is a generic problem with our backends or infrastructure left. SAP related tests are a different kind of problem, e.g. see #95458 and #95788 .

Based on graphs like https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=27&orgId=1&from=now-30d&to=now I can see that some openQA worker hosts are more likely to produce problems than others, e.g. openqaworker-arm-4 and openqaworker-arm-5. These both are specifically handled in tickets like #101048 but also they are not even enabled for multi-machine tests. The next in line after that is openqaworker2 which runs many "exotic" machines like vmware, hyperv, IPMI, s390x so that could explain the higher fail-ratio there. Since we record the data in the mentioned graph (about a month) I see only a significant increase in fail-ratio in openqaworker-arm-4/5, which I mentioned already. For other hosts it stays same or reduced.

Looking at https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=24&from=1632016393108&to=1635345865912 I can see no significant change over the reporting period of the last month. Mostly multi-machine tests are obsoleted, next biggest section is "passed".

Actions #20

Updated by okurz about 3 years ago

  • Description updated (diff)

@dzedro do you agree that the situation improved again or are you aware of still problematic areas except the known SAP test scenarios?

Actions #21

Updated by okurz almost 3 years ago

  • Status changed from Blocked to Resolved

No response. Assuming fixed for the specific issue at hand. In any case we have better monitoring now so we should have a better chance to detect such issues in the near-future. Also there are currently other related tickets still open with more specific information. See the related tasks.

Actions #22

Updated by dzedro almost 3 years ago

Sorry missed the comment, yes I agree that the failure rate of MM is much better.
Existing or new failures should be handled separately.

Actions

Also available in: Atom PDF