action #151310
closedcoordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
[regression] significant increase of parallel_failed+failed since 2023-11-21 size:M
Description
Motivation¶
As visible on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1700508932604&to=1700724085546&viewPanel=24
since about 2023-11-21 there is again a significant increase of multi-machine tests which should be investigated, mitigated, fixed and prevented.
Acceptance criteria¶
- AC1: failed+parallel_failed on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24 is significantly below 20% again
Suggestions¶
- Start to look into the issue early as waiting longer makes everything harder for us :)
- Lookup common failure sources and find out if it's actually not test or product regressions.
- Ask common stakeholders and/or test reviewers if they know something
- Review recent infrastructure changes which might be possibly related
- Mitigate, fix and prevent the issues you find
- Consider using the scientific method https://progress.opensuse.org/projects/openqav3/wiki/#Further-decision-steps-working-on-test-issues
- Use SQL queries to find out what failures are most common
- Consider using this opportunity to document one or two examples of how we commonly do that
Updated by mkittler 10 months ago
I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/687 to disable MM tests on worker35 and 36 as these are the most problematic (see query above). Also, after shutting down these machines yesterday "all jobs run without problem except one" which is definitely better than it was with these machines running MM tests.
Updated by mkittler 10 months ago
For future reference: One of the problematic scenarios was https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HA-Updates&machine=64bit&test=qam_alpha_cluster_01&version=15-SP2 - In the error case the hostname
module already shows that there's no IP assigned and downloads fail later on (the hostname
module itself doesn't fail; supposedly it would be good if this test would fail early).
Just for comparison:
- In passing runs an IP is assigned on eth0 and the mtu is 1458.
- In the failing runs no IP is assigned on eth0 and the mtu is 1500. The interface is shown as up at least.
Considering our discussion with Dirk it is good that the mtu is lowered and maybe the fact that it is not when the test is failing is the culprit. However, it could also just be a symptom. I'm also not sure where the mtu of 1450 that is shown in the good case comes from (it is supposedly either configured automatically or the test code somehow tries to set it).
Updated by mkittler 10 months ago · Edited
The MTU is actually lowered manually by tests via https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/data/wicked/dhcp/dhcpd.conf. The configure_static_ip
helper for MM tests defined in https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/lib/mm_network.pm does this as well. The support server setup is also taking care of lowering the MTU in https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/support_server/setup.pm.
Now the question is whether any of this code is executed in the problematic test scenarios. If not I'm inclined to blame the tests.
EDIT: It looks like the support server in this scenario (qam_alpha_supportserver@64bit
) would execute the configure_static_ip
and then the IP and MTU also shows up correctly (also when run as part of the failing cluster), e.g. https://openqa.suse.de/tests/12876164#step/setup/12 and https://openqa.suse.de/tests/12876164#step/setup/26. The other nodes don't seem to make an effort to lower the MTU. Maybe that's the missing bit that would help to stabilize these tests? It would still be strange that it is sometimes not necessary.
(In the problematic cluster the other node that ran on worker29 had a lowered MTU and IP address assigned despite no explicit setup. Strange that it on some nodes/workers it just works nevertheless. Supposedly this node really only parallel_failed because the other node on worker35 failed. This leaves one really wondering what the difference between "good" workers like 29 and "bad" workers like 35 and 36 is.)
Updated by nicksinger 10 months ago
based on https://docs.openvswitch.org/en/latest/faq/issues/ "Q: How can I configure the bridge internal interface MTU? Why does Open vSwitch keep changing internal ports MTU?" I just executed ovs-vsctl set int br1 mtu_request=1450
on worker 29,35,39 where @mkittler now triggered a new test cluster. If it does not work it can be reverted with: ovs-vsctl set int br1 mtu_request=[]
Updated by mkittler 10 months ago · Edited
Here's the restarted job cluster to see whether @nicksinger's changed made a difference (scheduled so that all jobs run on the same nodes as before where previously the job on worker35 failed): https://openqa.suse.de/tests/12909195
2 more clusters to see whether results are consistent: https://openqa.suse.de/tests/12909617, https://openqa.suse.de/tests/12909619
Updated by mkittler 10 months ago
- Status changed from Workable to In Progress
All tests have passed now. Maybe it makes sense to apply this setting everywhere. We could then also try to re-enable w35 and w36.
It could also be just luck, though. Especially because the mtu visible within the VMs does not match 1450 (it is 1458). So I restarted the three clusters again.
Updated by openqa_review 10 months ago
- Due date set to 2023-12-12
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 10 months ago
making mtu size application persistent
- https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1055
- https://github.com/os-autoinst/os-autoinst/pull/2409
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1055 is now merged. Now we should monitor for the impact on tests
@mkittler I suggest you schedule some more multi-machine test clusters.
Updated by mkittler 10 months ago
We enabled the mtu_request
setting on all workers via salt. Let's see what impact this has. If everything looks good we can merge https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/689 and monitor the fail ratio of MM jobs by hosts again to see whether the fail ratio of w35 and w36 has dropped.
Updated by mkittler 10 months ago
I created 50 more runs (https://openqa.suse.de/tests/overview?version=15-SP2&distri=sle&build=20231127-1-test-mm-1) to run on any host but only realizing now that none of the jobs will run on w35 or w36 because we disabled tap there.
So I created 25 more runs to run across w35 and w36 (https://openqa.suse.de/tests/overview?build=20231127-1-test-mm-2&distri=sle&version=15-SP2). This way we can now at least compare the fail ratio between w35/w36 and all others.
Updated by mkittler 9 months ago
Here's the fail/incomplete ratio of jobs with parallel dependencies by worker as of yesterday:
openqa=> select distinct count(jobs.id) as total, sum(case when jobs.result in ('failed', 'incomplete') then 1 else 0 end) * 100. / count(jobs.id) as fail_rate_percent, host from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) join workers on jobs.assigned_worker_id = workers.id where dependency = 2 and t_started >= '2023-11-29' group by host having count(jobs.id) > 50 order by fail_rate_percent desc;
total | fail_rate_percent | host
-------+---------------------+-------------
51 | 37.2549019607843137 | worker-arm1
468 | 26.7094017094017094 | worker38
437 | 24.7139588100686499 | worker29
552 | 24.4565217391304348 | worker40
396 | 22.7272727272727273 | worker39
401 | 22.6932668329177057 | worker37
326 | 22.0858895705521472 | worker35
385 | 21.2987012987012987 | worker30
331 | 18.4290030211480363 | worker36
(9 rows)
So w35 and w36 are not on top (with a gap) anymore (this role has now worker-arm1). That's good for w35/w36 so I'm keeping them enabled for MM jobs. Maybe changing the MTU setting really helped. I'll do this query again later/tomorrow.
Updated by mkittler 9 months ago
Now the fail ratio went a bit down:
openqa=> select distinct count(jobs.id) as total, sum(case when jobs.result in ('failed', 'incomplete') then 1 else 0 end) * 100. / count(jobs.id) as fail_rate_percent, host from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) join workers on jobs.assigned_worker_id = workers.id where dependency = 2 and t_started >= '2023-11-29' group by host having count(jobs.id) > 50 order by fail_rate_percent desc;
total | fail_rate_percent | host
-------+---------------------+-------------
805 | 22.4844720496894410 | worker38
138 | 22.4637681159420290 | worker-arm2
737 | 19.9457259158751696 | worker30
772 | 19.8186528497409326 | worker29
691 | 19.3921852387843705 | worker39
150 | 19.3333333333333333 | worker-arm1
905 | 18.7845303867403315 | worker40
644 | 18.0124223602484472 | worker37
642 | 17.7570093457943925 | worker35
608 | 15.7894736842105263 | worker36
(10 rows)
w35 and w36 and now even both at the bottom.
Updated by mkittler 9 months ago · Edited
Not much has changed except that openqaworker18.qa.suse.cz
is not running MM jobs and failing a lot with it:
openqa=> select distinct count(jobs.id) as total, sum(case when jobs.result in ('failed', 'incomplete') then 1 else 0 end) * 100. / count(jobs.id) as fail_rate_percent, host from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) join workers on jobs.assigned_worker_id = workers.id where dependency = 2 and t_started >= '2023-11-29' group by host having count(jobs.id) > 50 order by fail_rate_percent desc;
total | fail_rate_percent | host
-------+---------------------+----------------
80 | 25.0000000000000000 | openqaworker18
1162 | 19.4492254733218589 | worker38
822 | 18.1265206812652068 | worker-arm1
997 | 17.7532597793380140 | worker39
1173 | 17.3060528559249787 | worker29
1276 | 16.9278996865203762 | worker40
742 | 16.4420485175202156 | worker-arm2
1069 | 16.1833489242282507 | worker30
254 | 16.1417322834645669 | mania
909 | 15.8415841584158416 | worker37
964 | 15.7676348547717842 | worker35
936 | 15.0641025641025641 | worker36
EDIT: Looks like openqaworker18.qa.suse.cz
has over the last days just executed enough jobs to be on the list at all. I'm filtering out all workers that have only run less than 50 jobs on purpose because with only very few jobs the fail ratio might not be very meaningful. I suppose I should increase the 50 as the time frame since 2023-11-29 increases. So I guess that worker appearing is not a big deal after all.
Updated by mkittler 9 months ago
This all means that the fail rate is still higher than expected. Last time I had a closer look on the failures (on last Friday) it didn't seem to be due to a fundamental issue with the MM setup anymore, though. Likely the higher fail rate is mainly due to #151612. So maybe it makes sense to wait until this issue has been resolved.
Updated by okurz 9 months ago
- Related to action #151612: [kernel][tools] test fails in suseconnect_scc - SUT times out trying to reach https://scc.suse.com added
Updated by okurz 9 months ago
- Status changed from Feedback to In Progress
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1061 merged. If 1460 works then please also update https://github.com/os-autoinst/os-autoinst/pull/2409
But probably better try to bisect which is the biggest possible value to avoid both problems "A" (this ticket) as well as problem "B" #151612
Updated by mkittler 9 months ago
Update for os-autoinst: https://github.com/os-autoinst/os-autoinst/pull/2413
I would not go for the biggest possible value to avoid problem "A" and rather go for a round number in the middle to have wiggle room in both directions. If I'd go to an extreme it would actually be the lowest value we can set for the bridge (which would be 1458 for openSUSE tests unless we adjust them to make the MTU even lower) because an error in that direction will show immediately/consistently and not just sporadically and is thus potentially easier to debug.
Updated by mkittler 9 months ago
The graph looks good again for the last 24 hours (AC1): https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24&from=now-24h&to=now
Updated by mkittler 9 months ago
- Status changed from In Progress to Resolved
The fail ratio has deteriorated again but it is still below 20 % (~ 15 %). So I'm nevertheless considering this resolved. I mentioned still failing tests in the chat but it isn't clear whether they failed due to problems with the MM setup or due to other network-related problems.
Lots of the failing jobs from the yesterday I checked show symptoms that are unlikely caused by a generally broken MM setup. The list of problematic groups that have failing MM jobs since yesterday is quite small actually:
openqa=> select distinct count(jobs.id), array_agg(jobs.id), (select name from job_groups where id = group_id), (array_agg(test))[1] as example_test from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) where dependency = 2 and t_finished >= '2023-12-07T18:00' and result in ('failed') and test not like '%:investigate:%' group by group_id order by count(jobs.id) desc;
count | array_agg | name | example_test
-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------+---------------------------------------------------
20 | {12997767,12996153,12997053,12996840,12997128,12997120,12997481,12997483,12996349,12996756,12997269,12996156,12996442,12997064,12996149,12996549,12997685,12997051,12996560,12996685} | Security Maintenance Updates | fips_env_postgresql_ssl_server
18 | {12995494,12996941,12995690,12995340,12995691,12995605,12996946,12996952,12996948,12996933,12996931,12996928,12996937,12995339,12995501,12995502,12996950,12995341} | YaST Maintenance Updates - Development | mru-iscsi_client_normal_auth_backstore_fileio_dev
1 | {12994455} | JeOS: Development | jeos-nfs-client
1 | {12995646} | Wicked Maintenance Updates | qam_wicked_startandstop_sut
(4 rows)
Updated by okurz 9 months ago
- Related to action #152389: significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:M added