Project

General

Profile

coordination #96185

[epic] Multimachine failure rate increased

Added by dzedro 3 months ago. Updated 21 days ago.

Status:
Blocked
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2021-07-29
Due date:
2021-10-09
% Done:

50%

Estimated time:
(Total: 0.00 h)
Difficulty:

Description

Observation

There are much more MM failures than, I guess, 1-2 weeks before.
In this case I generally speak about two node MM jobs.

There is also category of three node MM jobs which is failing very often since ever, for Maintenance it's HA/SAP jobs.
As example I randomly took "always" failing https://openqa.suse.de/tests/6589394#next_previous
On my instance without remote workers the test passed at first and second run. http://dzedro.suse.cz/tests/18735
Same 3 node MM test on osd 100% fail, on small instance without remote workers 0% fail
First two examples are wicked test and on one is ping failing with 50%+ packet loss, which to me looks more like network issue.

I don't know if something changed in setup or network, both can have problem.
Network issue could be related to https://progress.opensuse.org/issues/95299

2 node, some of today failures
https://openqa.suse.de/tests/6588464#step/t05_dynamic_addresses_xml/260
https://openqa.suse.de/tests/6588818#step/t04_bonding_broadcast/11
https://openqa.suse.de/tests/6587990#step/iscsi_client/44
https://openqa.suse.de/tests/6588107#step/boot_to_desktop/10
https://openqa.suse.de/tests/6588108#step/2_sw_multipath_s_aa/1
https://openqa.suse.de/tests/6588254#step/installation/21
https://openqa.suse.de/tests/6588391#step/scc_registration/32
https://openqa.suse.de/tests/6588713#step/welcome/11
https://openqa.suse.de/tests/6590916#step/await_install/68

3 node, one as example, but there are tens of HA/SAP failures every day
https://openqa.suse.de/tests/6591092#step/register_without_ltss/9


Subtasks

action #96260: Failed to add GRE tunnel to openqaworker10 on most OSD workers, recent regression explaining multi-machine errors? size:MResolveddheidler

action #99135: Provide ratio of tests by result in monitoring - by workerResolvedokurz

action #99138: Provide ratio of tests by result in monitoring - by job groupNew

action #99141: Provide ratio of tests by result in monitoring - by machineNew


Related issues

Related to openQA Project - action #96191: Provide "fail-rate" of tests, especially multi-machine, in grafana size:MResolved2021-07-282021-09-29

Related to openQA Project - action #95299: Tests timeout with reason 'setup exceeded MAX_SETUP_TIME' on osd ppc64le workers auto_review:"Result: timeout":retry size:MResolved2021-07-09

Related to openQA Tests - action #95824: [qe-sap][ha][shap] test fails in register_system - unable to download license, likely network configuration problem in multi-machine cluster?Rejected2021-07-22

Related to openQA Tests - action #95801: [qe-sap][ha][css][shap] test fails in register_system of multi-machine HA tests, failing to access networkRejected2021-07-21

Related to openQA Tests - action #95788: [tools][qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retryWorkable2021-07-21

History

#1 Updated by dzedro 3 months ago

  • Description updated (diff)

#2 Updated by okurz 3 months ago

  • Related to action #96191: Provide "fail-rate" of tests, especially multi-machine, in grafana size:M added

#3 Updated by okurz 3 months ago

  • Related to action #95299: Tests timeout with reason 'setup exceeded MAX_SETUP_TIME' on osd ppc64le workers auto_review:"Result: timeout":retry size:M added

#4 Updated by okurz 3 months ago

  • Related to action #95824: [qe-sap][ha][shap] test fails in register_system - unable to download license, likely network configuration problem in multi-machine cluster? added

#5 Updated by okurz 3 months ago

  • Related to action #95801: [qe-sap][ha][css][shap] test fails in register_system of multi-machine HA tests, failing to access network added

#6 Updated by okurz 3 months ago

  • Related to action #95788: [tools][qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network auto_review:"(?s)tests/ha.*(post_fail_hook failed: command.*curl|command.+ping.+node0.+failed)":retry added

#7 Updated by okurz 3 months ago

  • Project changed from QA to openQA Project
  • Description updated (diff)
  • Category set to Concrete Bugs
  • Priority changed from Normal to High
  • Target version set to Ready

Thanks for your ticket. Just yesterday I came up with #96191 as related as well. Last week I already found multiple network related problems in multi-machine tests. I linked these as related.
However, all the different kind of multi-machine tests are quite different and it's again an area that SUSE QE Tools team members have not much experience with. So I don't see that we can offer much help from SUSE QE Tools side. Basically I would hope that the multi-machine test experts would assemble and look into the problem together.

EDIT: I asked for help in https://chat.suse.de/channel/testing?msg=F6s78REbS5XRXjpQ3

#8 Updated by asmorodskyi 3 months ago

I know that it does not help much , but I would remove this job https://openqa.suse.de/tests/6588107#step/boot_to_desktop/10 from description . boot to desktop failure hardly likely may be related to MM infra problems

#9 Updated by asmorodskyi 3 months ago

this two also looks like unrelated to MM -
https://openqa.suse.de/tests/6588108#step/2_sw_multipath_s_aa/1
https://openqa.suse.de/tests/6588254#step/installation/18

I think it is important to solve one problem at the time

#10 Updated by okurz 3 months ago

  • Tracker changed from action to coordination
  • Subject changed from Multimachine failure rate increased to [epic] Multimachine failure rate increased
  • Status changed from New to Blocked
  • Assignee set to okurz

Agreed. Making this an epic.

asmorodskyi helped to identify one issue about a failed GRE tunnel creation: #96260

I would be happy to see more feedback and problem analysis but at least we should rule out that #96260 is the cause of many or most of the failed examples.

#11 Updated by dzedro 3 months ago

asmorodskyi wrote:

I know that it does not help much , but I would remove this job https://openqa.suse.de/tests/6588107#step/boot_to_desktop/10 from description . boot to desktop failure hardly likely may be related to MM infra problems

No, it's MM test, one node is preparing PXE second node is booting from it, it can be easily MM or network problem.

#12 Updated by dzedro 3 months ago

asmorodskyi wrote:

this two also looks like unrelated to MM -
https://openqa.suse.de/tests/6588108#step/2_sw_multipath_s_aa/1
https://openqa.suse.de/tests/6588254#step/installation/18

I think it is important to solve one problem at the time

First one can be whatever, second one is not MM, it could be related to network issues, they can be of course removed.
Adding new MM failures every day is no problem at all, unfortunately.
This were just examples of MM or network failures, only MM failures can be used, but when single job tests are failing due to network then MM can fail due to network or MM or both.

#13 Updated by dzedro 3 months ago

Yesterday evening I restarted two MM clusters, node1 & node2 on this cluster run on openqaworker10, I would not care about this detail, I picked this tests as they fail always. I will restart also another cluster running on random worker and collect tcpdump. Here are tcpdumps and some info from workers when the failures happened. ftp://10.100.12.155/MM/
I did do manual retry on qam_ha_rolling_upgrade_migration, the reason test was failing on multiple places is failure in name resolution, this fail happened also on scc registration, unfortunately video is not there to see it, but I had to go to network setup and make installation to re-setup the DHCP setup again, sometimes multiple times. Generally the MM/network/DNS was unable to resolve addresses like scc.suse.com or updates.suse.com. https://openqa.suse.de/tests/6628646#step/register_without_ltss/10 when I retried it passed.

qam_alpha_cluster
https://openqa.suse.de/tests/6628644
https://openqa.suse.de/tests/6628643
https://openqa.suse.de/tests/6628642

qam_ha_rolling_upgrade_migration
https://openqa.suse.de/tests/6628647
https://openqa.suse.de/tests/6628646
https://openqa.suse.de/tests/6628645

#14 Updated by okurz 3 months ago

Sorry, I did not understand your last comment. Did you want to report about what you did or what you open to do? So do you think it would help to exclude openqaworker10 from all tests (not only multi-machine) and resolve #96260 first?

#15 Updated by okurz 3 months ago

I tried to find the "fail ratio per worker" with SQL but so far have not found a good approach. Maybe something along the lines of:

select count(jobs.id),workers.host from jobs left join workers on jobs.assigned_worker_id = workers.id join (select count(j.id) as failed_jobs_count,host from jobs j join workers w on j.assigned_worker_id = w.id group by w.host) failed_jobs on workers.host = failed_jobs.host group by workers.host order by count desc;

might work. I guess I need to either conduct a subquery as join or the other way around.

Now did it semi-automatic. All failed per worker host:

select count(jobs.id),host from jobs left join workers on jobs.assigned_worker_id = workers.id where result = 'failed' group by host order by count desc;
 count |        host         
-------+---------------------
 13660 | 
 11749 | openqaworker5
 11173 | grenache-1
  8573 | openqaworker2
  7271 | openqaworker6
  5727 | openqaworker9
  5405 | openqaworker13
  5349 | openqaworker8
  4869 | openqaworker-arm-2
  4619 | openqaworker3
  3200 | openqaworker10
  3016 | openqaworker-arm-1
  3005 | openqaworker-arm-3
  2953 | QA-Power8-4-kvm
  2415 | QA-Power8-5-kvm
  2187 | powerqaworker-qam-1
  1304 | malbec
   360 | automotive-3
(18 rows)

and all

select count(jobs.id),host from jobs left join workers on jobs.assigned_worker_id = workers.id group by host order by count desc;                                                                                                                                  
 count  |        host         
--------+---------------------
 182809 | 
  76185 | openqaworker5
  58381 | openqaworker6
  44507 | openqaworker9
  41977 | openqaworker8
  38853 | grenache-1
  38732 | openqaworker-arm-2
  37838 | openqaworker3
  32992 | openqaworker13
  28221 | openqaworker-arm-3
  27405 | openqaworker2
  21533 | openqaworker10
  20928 | openqaworker-arm-1
  18804 | QA-Power8-5-kvm
  18454 | QA-Power8-4-kvm
  16240 | powerqaworker-qam-1
   9627 | malbec
   1397 | automotive-3
    287 | openqaworker11
(19 rows)

and all failed jobs regardless of worker host:

select count(jobs.id) from jobs where result = 'failed';
 count 
-------
 96835
(1 row)

and all jobs regardless of worker host:

select count(jobs.id) from jobs;                                                                                                                                                                                                                                   
 count  
--------
 715170
(1 row)

so the total fail ratio is 96835/715170 = 13.54%, the openqaworker10 specific fail ratio is 14.86% . An arbitrary other example is openqaworker5 with 15.42%, for grenache-1 28.76% (!). So as conclusion openqaworker10 does not show any significantly different fail ratio itself hence I don't think we need to exclude it.

#16 Updated by dzedro 3 months ago

okurz wrote:

Sorry, I did not understand your last comment. Did you want to report about what you did or what you open to do? So do you think it would help to exclude openqaworker10 from all tests (not only multi-machine) and resolve #96260 first?

I reported what I did, I don't think that only openqaworker10 does have problem. Not sure how does #96260 affect the issue, GRE tunnel is either added or not, if not there is no connection between workers, but failures I see are more like random connection drops.

#17 Updated by okurz about 2 months ago

With #96260 resolved I added #96191 to the backlog now

#18 Updated by dzedro about 1 month ago

I tried to install/use ovs-test to debug the openvswitch.
But looks like the package openvswitch-test is broken.

# ovs-test -h
  File "/usr/bin/ovs-test", line 45
    print "Node %s:%u " % (node[0], node[1])
                      ^
SyntaxError: invalid syntax

There is also ovs-tcpdump or other ovs-* tools, but I didn't try if they work.
MM jobs are not failing so frequently anymore when the ticket was created, but there are still MM failures. Maybe it's network, I don't know.

Also available in: Atom PDF