action #151310: [regression] significant increase of parallel_failed+failed since 2023-11-21 size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #151310

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

[regression] significant increase of parallel_failed+failed since 2023-11-21 size:M

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2023-11-23

Due date:

% Done:

Estimated time:

Description

Motivation¶

As visible on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1700508932604&to=1700724085546&viewPanel=24
since about 2023-11-21 there is again a significant increase of multi-machine tests which should be investigated, mitigated, fixed and prevented.

Acceptance criteria¶

AC1: failed+parallel_failed on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24 is significantly below 20% again

Suggestions¶

Start to look into the issue early as waiting longer makes everything harder for us :)
Lookup common failure sources and find out if it's actually not test or product regressions.
Ask common stakeholders and/or test reviewers if they know something
Review recent infrastructure changes which might be possibly related
Mitigate, fix and prevent the issues you find
Consider using the scientific method https://progress.opensuse.org/projects/openqav3/wiki/#Further-decision-steps-working-on-test-issues
Use SQL queries to find out what failures are most common
- Consider using this opportunity to document one or two examples of how we commonly do that

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by livdywan over 1 year ago

Subject changed from [regression] significant increase of parallel_failed+failed since 2023-11-21 to [regression] significant increase of parallel_failed+failed since 2023-11-21 size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by mkittler over 1 year ago

Assignee set to mkittler

Actions

Copy link

Updated by mkittler over 1 year ago

I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/687 to disable MM tests on worker35 and 36 as these are the most problematic (see query above). Also, after shutting down these machines yesterday "all jobs run without problem except one" which is definitely better than it was with these machines running MM tests.

Actions

Copy link

Updated by okurz over 1 year ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/687 merged. Added according rollback step to #139103

Actions

Copy link

Updated by mkittler over 1 year ago

I also powered the machines on again and applied the salt states immediately and checked that the tap worker class is not there.

Actions

Copy link

Updated by mkittler over 1 year ago

For future reference: One of the problematic scenarios was https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HA-Updates&machine=64bit&test=qam_alpha_cluster_01&version=15-SP2 - In the error case the hostname module already shows that there's no IP assigned and downloads fail later on (the hostname module itself doesn't fail; supposedly it would be good if this test would fail early).

Just for comparison:

In passing runs an IP is assigned on eth0 and the mtu is 1458.
In the failing runs no IP is assigned on eth0 and the mtu is 1500. The interface is shown as up at least.

Considering our discussion with Dirk it is good that the mtu is lowered and maybe the fact that it is not when the test is failing is the culprit. However, it could also just be a symptom. I'm also not sure where the mtu of 1450 that is shown in the good case comes from (it is supposedly either configured automatically or the test code somehow tries to set it).

Actions

Copy link

Updated by mkittler over 1 year ago · Edited

The MTU is actually lowered manually by tests via https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/data/wicked/dhcp/dhcpd.conf. The configure_static_ip helper for MM tests defined in https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/lib/mm_network.pm does this as well. The support server setup is also taking care of lowering the MTU in https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/support_server/setup.pm.

Now the question is whether any of this code is executed in the problematic test scenarios. If not I'm inclined to blame the tests.

EDIT: It looks like the support server in this scenario (qam_alpha_supportserver@64bit) would execute the configure_static_ip and then the IP and MTU also shows up correctly (also when run as part of the failing cluster), e.g. https://openqa.suse.de/tests/12876164#step/setup/12 and https://openqa.suse.de/tests/12876164#step/setup/26. The other nodes don't seem to make an effort to lower the MTU. Maybe that's the missing bit that would help to stabilize these tests? It would still be strange that it is sometimes not necessary.

(In the problematic cluster the other node that ran on worker29 had a lowered MTU and IP address assigned despite no explicit setup. Strange that it on some nodes/workers it just works nevertheless. Supposedly this node really only parallel_failed because the other node on worker35 failed. This leaves one really wondering what the difference between "good" workers like 29 and "bad" workers like 35 and 36 is.)

Actions

Copy link

Updated by nicksinger over 1 year ago

based on https://docs.openvswitch.org/en/latest/faq/issues/ "Q: How can I configure the bridge internal interface MTU? Why does Open vSwitch keep changing internal ports MTU?" I just executed ovs-vsctl set int br1 mtu_request=1450 on worker 29,35,39 where @mkittler now triggered a new test cluster. If it does not work it can be reverted with: ovs-vsctl set int br1 mtu_request=[]

Actions

Copy link

#10

Updated by mkittler over 1 year ago · Edited

Here's the restarted job cluster to see whether @nicksinger's changed made a difference (scheduled so that all jobs run on the same nodes as before where previously the job on worker35 failed): https://openqa.suse.de/tests/12909195

2 more clusters to see whether results are consistent: https://openqa.suse.de/tests/12909617, https://openqa.suse.de/tests/12909619

Actions

Copy link

#11

Updated by mkittler over 1 year ago

Status changed from Workable to In Progress

All tests have passed now. Maybe it makes sense to apply this setting everywhere. We could then also try to re-enable w35 and w36.

It could also be just luck, though. Especially because the mtu visible within the VMs does not match 1450 (it is 1458). So I restarted the three clusters again.

Actions

Copy link

#12

Updated by openqa_review over 1 year ago

Due date set to 2023-12-12

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#13

Updated by okurz over 1 year ago

making mtu size application persistent

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1055 is now merged. Now we should monitor for the impact on tests

@mkittler I suggest you schedule some more multi-machine test clusters.

Actions

Copy link

#14

Updated by mkittler over 1 year ago

We enabled the mtu_request setting on all workers via salt. Let's see what impact this has. If everything looks good we can merge https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/689 and monitor the fail ratio of MM jobs by hosts again to see whether the fail ratio of w35 and w36 has dropped.

Actions

Copy link

#15

Updated by mkittler over 1 year ago

I created 50 more runs (https://openqa.suse.de/tests/overview?version=15-SP2&distri=sle&build=20231127-1-test-mm-1) to run on any host but only realizing now that none of the jobs will run on w35 or w36 because we disabled tap there.

So I created 25 more runs to run across w35 and w36 (https://openqa.suse.de/tests/overview?build=20231127-1-test-mm-2&distri=sle&version=15-SP2). This way we can now at least compare the fail ratio between w35/w36 and all others.

Actions

Copy link

#16

Updated by mkittler over 1 year ago

The pass-rate is a surprising 100 % in both groups (w35/w36, others). So I would say that w35/w36 look good enough to enable them again in production for MM jobs.

Actions

Copy link

#17

Updated by mkittler over 1 year ago

Status changed from In Progress to Feedback

I re-enabled w35/w36, mentioned it in the chat and I'm going to check the fail ratio tomorrow (or this evening if there are already enough jobs).

Actions

Copy link

#18

Updated by mkittler over 1 year ago

Here's the fail/incomplete ratio of jobs with parallel dependencies by worker as of yesterday:

openqa=> select distinct count(jobs.id) as total, sum(case when jobs.result in ('failed', 'incomplete') then 1 else 0 end) * 100. / count(jobs.id) as fail_rate_percent, host from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) join workers on jobs.assigned_worker_id = workers.id where dependency = 2 and t_started >= '2023-11-29' group by host having count(jobs.id) > 50 order by fail_rate_percent desc;
 total |  fail_rate_percent  |    host     
-------+---------------------+-------------
    51 | 37.2549019607843137 | worker-arm1
   468 | 26.7094017094017094 | worker38
   437 | 24.7139588100686499 | worker29
   552 | 24.4565217391304348 | worker40
   396 | 22.7272727272727273 | worker39
   401 | 22.6932668329177057 | worker37
   326 | 22.0858895705521472 | worker35
   385 | 21.2987012987012987 | worker30
   331 | 18.4290030211480363 | worker36
(9 rows)

So w35 and w36 are not on top (with a gap) anymore (this role has now worker-arm1). That's good for w35/w36 so I'm keeping them enabled for MM jobs. Maybe changing the MTU setting really helped. I'll do this query again later/tomorrow.

Actions

Copy link

#19

Updated by mkittler over 1 year ago

Now the fail ratio went a bit down:

openqa=> select distinct count(jobs.id) as total, sum(case when jobs.result in ('failed', 'incomplete') then 1 else 0 end) * 100. / count(jobs.id) as fail_rate_percent, host from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) join workers on jobs.assigned_worker_id = workers.id where dependency = 2 and t_started >= '2023-11-29' group by host having count(jobs.id) > 50 order by fail_rate_percent desc;
 total |  fail_rate_percent  |    host     
-------+---------------------+-------------
   805 | 22.4844720496894410 | worker38
   138 | 22.4637681159420290 | worker-arm2
   737 | 19.9457259158751696 | worker30
   772 | 19.8186528497409326 | worker29
   691 | 19.3921852387843705 | worker39
   150 | 19.3333333333333333 | worker-arm1
   905 | 18.7845303867403315 | worker40
   644 | 18.0124223602484472 | worker37
   642 | 17.7570093457943925 | worker35
   608 | 15.7894736842105263 | worker36
(10 rows)

w35 and w36 and now even both at the bottom.

Actions

Copy link

#20

Updated by mkittler over 1 year ago · Edited

Not much has changed except that openqaworker18.qa.suse.cz is not running MM jobs and failing a lot with it:

openqa=> select distinct count(jobs.id) as total, sum(case when jobs.result in ('failed', 'incomplete') then 1 else 0 end) * 100. / count(jobs.id) as fail_rate_percent, host from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) join workers on jobs.assigned_worker_id = workers.id where dependency = 2 and t_started >= '2023-11-29' group by host having count(jobs.id) > 50 order by fail_rate_percent desc;
 total |  fail_rate_percent  |      host      
-------+---------------------+----------------
    80 | 25.0000000000000000 | openqaworker18
  1162 | 19.4492254733218589 | worker38
   822 | 18.1265206812652068 | worker-arm1
   997 | 17.7532597793380140 | worker39
  1173 | 17.3060528559249787 | worker29
  1276 | 16.9278996865203762 | worker40
   742 | 16.4420485175202156 | worker-arm2
  1069 | 16.1833489242282507 | worker30
   254 | 16.1417322834645669 | mania
   909 | 15.8415841584158416 | worker37
   964 | 15.7676348547717842 | worker35
   936 | 15.0641025641025641 | worker36

EDIT: Looks like openqaworker18.qa.suse.cz has over the last days just executed enough jobs to be on the list at all. I'm filtering out all workers that have only run less than 50 jobs on purpose because with only very few jobs the fail ratio might not be very meaningful. I suppose I should increase the 50 as the time frame since 2023-11-29 increases. So I guess that worker appearing is not a big deal after all.

Actions

Copy link

#21

Updated by mkittler over 1 year ago

This all means that the fail rate is still higher than expected. Last time I had a closer look on the failures (on last Friday) it didn't seem to be due to a fundamental issue with the MM setup anymore, though. Likely the higher fail rate is mainly due to #151612. So maybe it makes sense to wait until this issue has been resolved.

Actions

Copy link

#22

Updated by okurz over 1 year ago

Related to action #151612: [kernel][tools] test fails in suseconnect_scc - SUT times out trying to reach https://scc.suse.com added

Actions

Copy link

#23

Updated by okurz over 1 year ago

Parent task set to #111929

Actions

Copy link

#24

Updated by okurz over 1 year ago

Status changed from Feedback to In Progress

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1061 merged. If 1460 works then please also update https://github.com/os-autoinst/os-autoinst/pull/2409

But probably better try to bisect which is the biggest possible value to avoid both problems "A" (this ticket) as well as problem "B" #151612

Actions

Copy link

#25

Updated by mkittler over 1 year ago

Update for os-autoinst: https://github.com/os-autoinst/os-autoinst/pull/2413

I would not go for the biggest possible value to avoid problem "A" and rather go for a round number in the middle to have wiggle room in both directions. If I'd go to an extreme it would actually be the lowest value we can set for the bridge (which would be 1458 for openSUSE tests unless we adjust them to make the MTU even lower) because an error in that direction will show immediately/consistently and not just sporadically and is thus potentially easier to debug.

Actions

Copy link

#26

Updated by mkittler over 1 year ago

The graph looks good again for the last 24 hours (AC1): https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24&from=now-24h&to=now

Actions

Copy link

#27

Updated by mkittler over 1 year ago

Status changed from In Progress to Resolved

The fail ratio has deteriorated again but it is still below 20 % (~ 15 %). So I'm nevertheless considering this resolved. I mentioned still failing tests in the chat but it isn't clear whether they failed due to problems with the MM setup or due to other network-related problems.

Lots of the failing jobs from the yesterday I checked show symptoms that are unlikely caused by a generally broken MM setup. The list of problematic groups that have failing MM jobs since yesterday is quite small actually:

openqa=> select distinct count(jobs.id), array_agg(jobs.id), (select name from job_groups where id = group_id), (array_agg(test))[1] as example_test from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) where dependency = 2 and t_finished >= '2023-12-07T18:00' and result in ('failed') and test not like '%:investigate:%' group by group_id order by count(jobs.id) desc;
 count |                                                                                       array_agg                                                                                       |                  name                  |                   example_test                    
-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------+---------------------------------------------------
    20 | {12997767,12996153,12997053,12996840,12997128,12997120,12997481,12997483,12996349,12996756,12997269,12996156,12996442,12997064,12996149,12996549,12997685,12997051,12996560,12996685} | Security Maintenance Updates           | fips_env_postgresql_ssl_server
    18 | {12995494,12996941,12995690,12995340,12995691,12995605,12996946,12996952,12996948,12996933,12996931,12996928,12996937,12995339,12995501,12995502,12996950,12995341}                   | YaST Maintenance Updates - Development | mru-iscsi_client_normal_auth_backstore_fileio_dev
     1 | {12994455}                                                                                                                                                                            | JeOS: Development                      | jeos-nfs-client
     1 | {12995646}                                                                                                                                                                            | Wicked Maintenance Updates             | qam_wicked_startandstop_sut
(4 rows)

Actions

Copy link

#28

Updated by okurz over 1 year ago

Related to action #152389: significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:M added

Actions

Copy link

#29

Updated by okurz over 1 year ago

Due date deleted (~~2023-12-12~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #151310

[regression] significant increase of parallel_failed+failed since 2023-11-21 size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by livdywan over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago · Edited

Updated by nicksinger over 1 year ago

Updated by mkittler over 1 year ago · Edited

Updated by mkittler over 1 year ago

Updated by openqa_review over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago · Edited

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago