Project

General

Profile

Actions

action #152389

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:M

Added by okurz about 1 year ago. Updated 11 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2023-12-11
Due date:
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario sle-15-SP5-Server-DVD-Updates-x86_64-qam_kernel_multipath@64bit fails in
multipath_iscsi

Test suite description

Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml. Maintainer: jpupava on 15sp1 is problem missing python-xml package

Reproducible

Fails since (at least) Build 20231210-1 (current job)

Expected result

Last good: 20231208-1 (or more recent)

Acceptance criteria

Problem

Pinging (as of certain sizes via -s parameter) and certain traffic (e.g. SSH) hangs when using via GRE tunnels (the MM test setup).

H1 -> E1-1 take a look into openQA investigate results -> O1-1-1 openqa-investigate in job $url proves no changes in product

Problem

  • H1 REJECTED The product has changed -> unlikely because it happened accross all products at the same time

  • H2 Fails because of changes in test setup

    • H2.1 REJECTED Recent changes of the MTU size on the bridge on worker hosts made a difference -> E2.1-1 Revert changes from https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1061 manually -> O2.1-1-1 reverting manually on two worker hosts didn't make any difference -> E2.1-2 Explicitly set back MTU to default value with ovs-vsctl -> #152389#note-26
    • H2.2 ACCEPTED The network behaves differently -> considering the testing with tracepath in #152557 the network definitely behaves in an unexpected way when it comes to routing and it most likely was not always like that
    • H2.2.1 Issue #152557 means that there might now be additional hops that lower the physical MTU size leading to the problem we see -> E2.2.1-1 wait until the SD ticket has been resolved and see if it works after that as before -> The routing is to be expected as between PRG1/NUE2 and PRG2 there is an ipsec tunnel involved which most likely impacts the effective MTU limit in the observed ways, E2.2.1-2 check whether lowering the MTU would help (as that might indicate that the physical MTU size is indeed reduced) -> lowering the MTU size within the VM helps indeed, see #152389#note-25
    • H2.3 The automatic reboot of machines in our Sunday maintenance window had an impact -> E2.3-1 First check if workers actually rebooted -> O2.3-1-1 sudo salt -C 'G@roles:worker' cmd.run 'w' shows that all workers rebooted on last Sunday so there was a reboot -> E2.3-2 Implement #152095 for better investigation
    • H2.4 REJECTED Scenarios failing now were actually never tested as part of https://progress.opensuse.org/issues/151310 -> the scenario https://openqa.suse.de/tests/13018864 was passing at the time when #151310 was resolved and queries for failing MM jobs done in #151310 didn't the many failures we see now
    • H2.5 ACCEPTED There is something wrong in the combination of GRE tunnels with more than 2 physical hosts (#152389-10) -> E2.5-1 Run a multi-machine cluster between two non-production hosts with only GRE tunnels between those two enabled -> with just 2 physical hosts the problem is indeed no longer reproducible, see #152389#note-29
    • H2.5.1 ACCEPTED Only a specific worker host (or the way it is connected) is problematic -> E2.5.1 Add further hosts step by step -> it seems that adding qesapworker-prg4.qa.suse.cz to the cluster causes the problem (but not similar workers like qesapworker-prg5.qa.suse.cz), see #152389#note-32 and #152389#note-35
  • H3 REJECTED Fails because of changes in test infrastructure software, e.g. os-autoinst, openQA
    -> O3-1-1 Comparing "first bad" https://openqa.suse.de/tests/13018864/logfile?filename=autoinst-log.txt os-autoinst version 4.6.1702036503.3b9f3a2 and "last good" 4.6.1701963272.58c0dd5 yielding

    $ git log1 --no-merges 58c0dd5..3b9f3a2
    fdf5f064 Improve `sudo`-usage in `t/20-openqa-isotovideo-utils.t`
    2f9d913a Consider code as generally uncoverable when testing relies on `sudo`
    

    also no relevant changes in openQA at all

  • H4 REJECTED Fails because of changes in test management configuration, e.g. openQA database settings -> O4-1-1 no relevant changes, see https://openqa.suse.de/tests/13018864#investigation

  • H5 REJECTED Fails because of changes in the test software itself (the test plan in source code as well as needles) -> no changes, see e.g. https://openqa.suse.de/tests/13018864#investigation

  • H6 REJECTED Sporadic issue, i.e. the root problem is already hidden in the system for a long time but does not show symptoms every time -> O6-1-1 https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24&from=1701825350155&to=1702376526900 shows a clear regression by statistic

Suggestions

Debug in VMs (using the developer mode or by creating VMs manually) as we have already started in #152389#note-10 an subsequent comments.

The mentioned scenario is an easy reproducer but not the only affected scenario. Use e.g.

select distinct count(jobs.id), array_agg(jobs.id), (select name from job_groups where id = group_id), (array_agg(test))[1] as example_test from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) where dependency = 2 and t_finished >= '2023-12-05T18:00' and result in ('failed', 'incomplete') and test not like '%:investigate:%' group by group_id order by count(jobs.id) desc;

to find possibly also affected and relevant scenarios.

  • Lower the MTU, but keep in mind minimums for IPv6? according to dheidler referencing wikipedia minimum would be 1280 bytes
  • Use more hosts in different locations and see if those work more reliably in production
  • Ensure tap is not used by machines that don't work since that would still affect the gre setup
  1. Make the use of MTU in os-autoinst-distri-opensuse configurable, i.e. the value that the support_server sets as well as the ping_size_check
  2. DONE Use the minimum MTU in both os-autoinst-distri-opensuse as well as https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls to be "on the safe side" for across-location tunneled GRE configurations -> https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18351 (merged)
  3. DONE Only enable "tap" in our workerconf.sls within one location per one architecture, e.g. only ppc64le tap in NUE2, only x86_64 tap in PRG2 -> https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/701 (merged) and https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/702 (merged)
  4. Change https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls to only enable GRE tunnels on machines within the same datacenter, e.g. "location-prg" and rename accordingly in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls
  5. Follow #152737 regarding scheduling limited to individual zone

Rollback steps

  1. Revert https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/693 disabling all tap classes except one x86_64 worker hosts
  2. Revert https://gitlab.suse.de/openqa/salt-pillars-openqa/-/commit/4be80b2c720f6023b20355c9f4ac71096dc0aee4
  3. Remove silence from https://monitor.qa.suse.de/alerting/silences "alertname=Ratio of multi-machine tests by result alert"

Further details

Always latest result in this scenario: latest


Files

screenshot_20231211_183832.png (36.5 KB) screenshot_20231211_183832.png mkittler, 2023-12-11 17:40

Related issues 12 (0 open12 closed)

Related to openQA Project (public) - action #138698: significant increase in multi-machine test failures on OSD since 2023-10-25, e.g. test fails in support_server/setup size:MResolvedmkittler2023-10-27

Actions
Related to openQA Project (public) - action #136154: multimachine tests restarted by RETRY test variable end up without the proper dependency size:MResolvedmkittler

Actions
Related to openQA Project (public) - action #151310: [regression] significant increase of parallel_failed+failed since 2023-11-21 size:MResolvedmkittler2023-11-23

Actions
Related to openQA Tests (public) - action #151612: [kernel][tools] test fails in suseconnect_scc - SUT times out trying to reach https://scc.suse.comResolvedmkittler2023-11-28

Actions
Related to openQA Tests (public) - action #152461: [core][tools] test fails in various s390x-kvm tests with "s390x-kvm[\S\s]*(command 'zypper -n in[^\n]*timed out|sh install_k3s.sh[^\n]*failed)"Resolvedokurz2023-12-12

Actions
Related to openQA Infrastructure (public) - action #152095: [spike solution][timeboxed:8h] Ping over GRE tunnels and TAP devices and openvswitch outside a VM with differing packet sizes size:SResolvedjbaier_cz2023-12-05

Actions
Related to openQA Tests (public) - action #152755: [tools] test fails in scc_registration - SCC not reachable despite not running multi-machine tests? size:MResolvedmkittler2023-12-19

Actions
Related to openQA Project (public) - action #153769: Better handle changes in GRE tunnel configuration size:MResolvedokurz2024-01-17

Actions
Related to openQA Project (public) - action #154552: [ppc64le] test fails in iscsi_client - zypper reports Error Message: Could not resolve host: openqa.suse.deResolvedmkittler2024-01-30

Actions
Copied to openQA Infrastructure (public) - action #152557: unexpected routing between PRG1/NUE2+PRG2Resolvedokurz

Actions
Copied to openQA Project (public) - action #153880: https://openqa.suse.de/tests/13277880#step/patterns/96 not being able to resolve download.suse.de, likely DNS problems in PRG1Resolvedokurz

Actions
Copied to openQA Infrastructure (public) - action #160652: Secondary TAP worker class in different zones size:SResolvedybonatakis

Actions
Actions #1

Updated by okurz about 1 year ago

  • Related to action #138698: significant increase in multi-machine test failures on OSD since 2023-10-25, e.g. test fails in support_server/setup size:M added
Actions #2

Updated by okurz about 1 year ago

  • Subject changed from significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size to significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU":retry
Actions #3

Updated by okurz about 1 year ago

  • Related to coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers added
Actions #4

Updated by okurz about 1 year ago

  • Related to deleted (coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers)
Actions #5

Updated by okurz about 1 year ago

  • Parent task set to #111929
Actions #6

Updated by okurz about 1 year ago

  • Related to action #136154: multimachine tests restarted by RETRY test variable end up without the proper dependency size:M added
Actions #7

Updated by okurz about 1 year ago

  • Assignee set to mkittler
Actions #8

Updated by okurz about 1 year ago

  • Subject changed from significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU":retry to significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU"
Actions #9

Updated by livdywan about 1 year ago

scheme=http host=openqa.suse.de ./openqa-label-known-issues http://openqa.suse.de/tests/1301886
Actions #10

Updated by okurz about 1 year ago · Edited

openqa-clone-job --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/13018864 _GROUP=0 BUILD=poo152389 TEST+=-152389-okurz

=> sle-15-SP5-Server-DVD-Updates-x86_64-Build20231210-1-qam_kernel_multipath@64bit -> https://openqa.suse.de/tests/13037796

failed reproducibly the same.

negligible change in os-autoinst according to

$ git log1 --no-merges 58c0dd5..3b9f3a2
fdf5f064 Improve `sudo`-usage in `t/20-openqa-isotovideo-utils.t`
2f9d913a Consider code as generally uncoverable when testing relies on `sudo`

diffing between last good https://openqa.suse.de/tests/13010854 and first bad https://openqa.suse.de/tests/13018864 . https://openqa.suse.de/tests/13018864#investigation shows no significant change in test distribution, settings, needles, etc.

From https://openqa.suse.de/tests/13018864/file/multipath_iscsi-ip-addr-show.log I see mtu 1458, so … is that bad?

Using salt I looked up if the MTU size on openvswitch is still set:

sudo salt -C 'G@roles:worker' cmd.run 'sudo ovs-vsctl get int br1 mtu_request'
openqaworker18.qa.suse.cz:
    1460
…

same on all machines. So that setting is still active, seems persistent. I reran my investigation job https://openqa.suse.de/tests/13037796 and paused at the failing module, logged in interactively and could confirm that ping -M do -s 1350 -c 1 10.0.2.1 worked fine if both jobs of the cluster are running on the same machine (w30+w30) but failed when running on different machines like in https://openqa.suse.de/tests/13037991 on w36, support_server on w35

I ran tcpdump -l -i br1 icmp on both w35+w36 and could see ICMP echo requests going in and out up to a requested size of 1336 bytes but not above. Then only the echo request is written on the outgoing physical host w36 but not received on w35.

A ping between w35+w36 work up to ping -4 -M do -s 1458 worker36.oqa.prg2.suse.org where tcpdump shows length 1466. Then I ran sudo tcpdump -l -i any 'proto gre' | grep --line-buffered 'worker35.*ICMP echo' and could see

15:13:22.753275 eth0  Out IP worker36.oqa.prg2.suse.org > worker35.oqa.prg2.suse.org: GREv0, length 1386: IP 10.0.2.15 > 10.0.2.1: ICMP ech  request, id 14, seq 277, length 1344
15:13:22.756001 eth0  Out IP worker36.oqa.prg2.suse.org > worker35.oqa.prg2.suse.org: GREv0, length 1386: IP 10.0.2.1 > 10.0.2.15: ICMP ech  reply, id 14, seq 277, length 1344

so outgoing and ingoing requests tunneled over GRE for requested size of 1336 bytes which is 1344 within GRE and 1386 bytes on the outer layer. Also we can see the reply with same parameters. For a requested size of 1337 when there is no response I see

15:13:32.656890 eth0  Out IP worker36.oqa.prg2.suse.org > worker35.oqa.prg2.suse.org: GREv0, length 1387: IP 10.0.2.15 > 10.0.2.1: ICMP ech  request, id 15, seq 1, length 1345

What I don't understand is why on w35 the request comes from qesapworker-prg4.qa.suse.cz instead of the expected w36 and also goes back to there. sudo tcpdump -n -l -i any 'proto gre' | grep --line-buffered 'ICMP echo' actually shows that the same request goes to all GRE connected devices

15:25:16.440214 eth0  In  IP 10.145.10.9 > 10.145.10.8: GREv0, length 1386: IP 10.0.2.15 > 10.0.2.1: ICMP echo request, id 20, seq 1, length 1344
15:25:16.441741 eth0  In  IP 10.100.101.74 > 10.145.10.8: GREv0, length 1386: IP 10.0.2.15 > 10.0.2.1: ICMP echo request, id 20, seq 1, length 1344
15:25:16.442250 eth0  Out IP 10.145.10.8 > 10.168.192.252: GREv0, length 1386: IP 10.0.2.15 > 10.0.2.1: ICMP echo request, id 20, seq 1, length 1344
15:25:16.442258 eth0  Out IP 10.145.10.8 > 10.168.192.108: GREv0, length 1386: IP 10.0.2.15 > 10.0.2.1: ICMP echo request, id 20, seq 1, length 1344
15:25:16.442264 eth0  Out IP 10.145.10.8 > 10.168.192.254: GREv0, length 1386: IP 10.0.2.15 > 10.0.2.1: ICMP echo request, id 20, seq 1, length 1344
15:25:16.442270 eth0  Out IP 10.145.10.8 > 10.100.101.76: GREv0, length 1386: IP 10.0.2.15 > 10.0.2.1: ICMP echo request, id 20, seq 1, length 1344
15:25:16.442275 eth0  Out IP 10.145.10.8 > 10.100.101.78: GREv0, length 1386: IP 10.0.2.15 > 10.0.2.1: ICMP echo request, id 20, seq 1, length 1344
15:25:16.442281 eth0  Out IP 10.145.10.8 > 10.100.101.80: GREv0, length 1386: IP 10.0.2.15 > 10.0.2.1: ICMP echo request, id 20, seq 1, length 1344
15:25:16.442287 eth0  Out IP 10.145.10.8 > 10.145.10.33: GREv0, length 1386: IP 10.0.2.15 > 10.0.2.1: ICMP echo request, id 20, seq 1, length 1344
15:25:16.442293 eth0  Out IP 10.145.10.8 > 10.145.10.34: GREv0, length 1386: IP 10.0.2.15 > 10.0.2.1: ICMP echo request, id 20, seq 1, length 1344
15:25:16.442298 eth0  Out IP 10.145.10.8 > 10.145.10.2: GREv0, length 1386: IP 10.0.2.15 > 10.0.2.1: ICMP echo request, id 20, seq 1, length 1344
15:25:16.442304 eth0  Out IP 10.145.10.8 > 10.145.10.3: GREv0, length 1386: IP 10.0.2.15 > 10.0.2.1: ICMP echo request, id 20, seq 1, length 1344
15:25:16.442309 eth0  Out IP 10.145.10.8 > 10.145.10.4: GREv0, length 1386: IP 10.0.2.15 > 10.0.2.1: ICMP echo request, id 20, seq 1, length 1344
15:25:16.442314 eth0  Out IP 10.145.10.8 > 10.145.10.5: GREv0, length 1386: IP 10.0.2.15 > 10.0.2.1: ICMP echo request, id 20, seq 1, length 1344
15:25:16.442319 eth0  Out IP 10.145.10.8 > 10.145.10.6: GREv0, length 1386: IP 10.0.2.15 > 10.0.2.1: ICMP echo request, id 20, seq 1, length 1344
15:25:16.442325 eth0  Out IP 10.145.10.8 > 10.145.10.10: GREv0, length 1386: IP 10.0.2.15 > 10.0.2.1: ICMP echo request, id 20, seq 1, length 1344
15:25:16.442331 eth0  Out IP 10.145.10.8 > 10.145.10.11: GREv0, length 1386: IP 10.0.2.15 > 10.0.2.1: ICMP echo request, id 20, seq 1, length 1344
15:25:16.442335 eth0  Out IP 10.145.10.8 > 10.145.10.12: GREv0, length 1386: IP 10.0.2.15 > 10.0.2.1: ICMP echo request, id 20, seq 1, length 1344
15:25:16.442341 eth0  Out IP 10.145.10.8 > 10.145.10.13: GREv0, length 1386: IP 10.0.2.15 > 10.0.2.1: ICMP echo request, id 20, seq 1, length 1344
15:25:16.442515 eth0  Out IP 10.145.10.8 > 10.100.101.74: GREv0, length 1386: IP 10.0.2.1 > 10.0.2.15: ICMP echo reply, id 20, seq 1, length 1344

so it seems the request is sent over all GRE connections, the response only sent over one.

View from w36:

15:29:02.942767 eth0  Out IP 10.145.10.9 > 10.100.101.74: GREv0, length 1386: IP 10.0.2.15 > 10.0.2.1: ICMP echo request, id 23, seq 1, length 1344
15:29:02.946608 eth0  In  IP 10.100.101.74 > 10.145.10.9: GREv0, length 1386: IP 10.0.2.1 > 10.0.2.15: ICMP echo reply, id 23, seq 1, length 1344

with 10.100.101.74 being qesapworker-prg4.qa.suse.cz. That's confusing.

Actions #11

Updated by okurz about 1 year ago

  • Related to action #151310: [regression] significant increase of parallel_failed+failed since 2023-11-21 size:M added
Actions #12

Updated by okurz about 1 year ago

  • Related to action #151612: [kernel][tools] test fails in suseconnect_scc - SUT times out trying to reach https://scc.suse.com added
Actions #13

Updated by okurz about 1 year ago

  • Description updated (diff)
  • Status changed from New to In Progress

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/693 merged, added rollback step and retriggering jobs with

env host=openqa.suse.de result="result='parallel_failed'" failed_since="2023-12-10" comment="https://progress.opensuse.org/issues/152389" openqa-advanced-retrigger-jobs

retriggering about 1k jobs.

Actions #14

Updated by mkittler about 1 year ago · Edited

From https://openqa.suse.de/tests/13018864/file/multipath_iscsi-ip-addr-show.log I see mtu 1458, so … is that bad?

I would say it is not as it is lower than or equal to what we have set on the bridge (1460).


We created https://openqa.suse.de/tests/13037796#step/multipath_iscsi/20 which ran on w40 (and support server on w39) where the ping test ping -M do -s 1350 -c 1 10.0.2.1; failed with:

PING 10.0.2.1 (10.0.2.1) 1350(1378) bytes of data.

--- 10.0.2.1 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

I setup two test VMs on w39 and w40 manually (according to https://github.com/os-autoinst/openQA/pull/5394) where I could also reproduce the issue in both directions.

It works just fine up to -s 1340 but as of -s 1341 it fails. It is notable that when dropping -M do it still fails. Normally I'd expected that it would still work (using fragmentation).

Note that HTTP traffic works as well but SSH already does not (see screenshot with output of ssh -v …; the final connection close there only happened after awaiting a timeout). I guess SSH not working as well is definitely showing the real problem here (as opposed to the rather artificial ping call).

The boundary of 1340 is actually different than the one @okurz mentioned before:

so outgoing and ingoing requests tunneled over GRE for requested size of 1336 bytes which is 1344 within GRE and 1386 bytes on the outer layer. Also we can see the reply with same parameters. For a requested size of 1337 when there is no response I see

I assume the difference between out experiments is that @okurz pinged a worker host from the VM and I pinged another VM from the VM. I couldn't reproduce the behavior exactly, though.

I tried to ping w40 from the VM on w39 via ping -M do -s 1430 -c1 10.145.10.13 and it still worked. Only ping -M do -s 1431 -c1 10.145.10.13 failed but with the very explicit error "ping: local error: message too long, mtu=1458". If I drop -M do to allow fragmentation then the ping works just fine as well (then also e.g. -s 10000 works just fine). It is exactly the same the other way around. Note that SSH traffic from a VM to just another worker host works as well.


I'm going to debug this further with tcpdump tomorrow. We're probably having the issue of fragments being dropped. The VM-to-VM traffic seems mainly affected (not the VM-to-whatever traffic).

Actions #15

Updated by mkittler about 1 year ago

SSH test (see previous comment): SSH test

Actions #16

Updated by okurz about 1 year ago · Edited

  • Description updated (diff)

mkittler wrote in #note-14:

I assume the difference between out experiments is that @okurz pinged a worker host from the VM and I pinged another VM from the VM. I couldn't reproduce the behavior exactly, though.

No, it was actually VM to VM, so VM on worker36 over GRE tunnels to VM on worker35.

I found that my mitigation to disable the tap class was incomplete due to outdated local git workspace so I added
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/commit/4be80b2c720f6023b20355c9f4ac71096dc0aee4 to also disable tap on w35+w36 and retriggered tests accordingly.

https://openqa.suse.de/tests/13042194 from the original scenario is passed so mitigation is effective.

Actions #17

Updated by okurz about 1 year ago

  • Description updated (diff)
Actions #18

Updated by mkittler about 1 year ago

By the way, I also had a look at MAC addresses because I had to pick ones for my VMs that don't conflict with what we use for our normal VMs.

I exported relevant worker IDs via

\copy ( select distinct workers.id from workers join worker_properties on workers.id = worker_properties.worker_id where worker_properties.key = 'WORKER_CLASS' and worker_properties.value like '%tap%' ) to '/tmp/tap_workers' csv;

and computed the MAC addresses we'd assign to our VMs via

perl -e 'use Mojo::File qw(path); print(map { $workerid = $_; map { sprintf("52:54:00:12:%02x:%02x\n", int($workerid / 256) + $_ * 64, $workerid % 256) } (1..3) } split("\n", path("/hdd//tmp/tap_workers")->slurp))'

according to code in os-autoinst/backend/qemu.pm to check whether we'd create invalid MAC addressed. At this point we don't create any (and there are also no duplicates). I guess we also have lots of room as the highest worker ID until we'd create invalid IDs would be 16320 (0x40 * 0xFF).

Actions #19

Updated by mkittler about 1 year ago

No, it was actually VM to VM, so VM on worker36 over GRE tunnels to VM on worker35.

Strange, so depending on the particular workers we get different results (w35/36 vs. w39/40). But I guess it is the same problem nevertheless.

Actions #20

Updated by openqa_review about 1 year ago

  • Due date set to 2023-12-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions #21

Updated by okurz about 1 year ago

  • Related to action #152461: [core][tools] test fails in various s390x-kvm tests with "s390x-kvm[\S\s]*(command 'zypper -n in[^\n]*timed out|sh install_k3s.sh[^\n]*failed)" added
Actions #22

Updated by okurz about 1 year ago

I think it's likely that #152461 could also be related to this.

Actions #23

Updated by mkittler about 1 year ago

  • Description updated (diff)
Actions #25

Updated by mkittler about 1 year ago · Edited

Setting the MTU on the bridge back to 1450 on both workers via sudo ovs-vsctl set int br1 mtu_request=1450 did not change anything. So the recent bump from 1450 to 1460 is not responsible. Reverting the MTU on the bridge back to the supposed default of 1500 changed nothing as well.

Lowering the MTU to 1000 on both bridges and to 1000 also within the VMs actually helps. Likely one doesn't have to go that low; I still have to figure out the boundary. When keeping the MTU on the VMs at 1458 it would not help. However, it seems the MTU can be higher than 1000 within the VMs. Of course this is now without -M do because otherwise we'd of course run into "message too long, mtu=1000" again. But before it also got stuck without -M do so lowering the MTU on the bridge further definitely improves the situation. This way it is also possible to establish an SSH connection between the machines which timed out before as well.


EDIT: It looks like that lowering the MTU to 1367 in the VMs and keeping the bridges on 1460 is actually sufficient. This still means -M do does not work (1367 is bigger than 1350 what we use in the ping command but with the overhead it apparently exceeds 1367). But SSH and probably other applications we actually care about do. So at least as a mitigation we could:

  1. Lower the MTU within the SUTs from 1458 to 1367 (or maybe go even a little bit lower than the exact boundary).
  2. Change the ping test from ping -M do -s 1350 … to ping -M do -s 1339 … (or again go a little bit lower than the exact boundary).
Actions #26

Updated by mkittler about 1 year ago

I now retried reverting our MTU changes again as in the previous comment but now via sudo ovs-vsctl set int br1 mtu_request=[] which will actually restore open vSwitch's default behavior (as documented on https://docs.openvswitch.org/en/latest/faq/issues). However, that doesn't seem to change any of the various cases and behaviors mentioned in my last comment.

Actions #27

Updated by okurz about 1 year ago

  • Related to action #152095: [spike solution][timeboxed:8h] Ping over GRE tunnels and TAP devices and openvswitch outside a VM with differing packet sizes size:S added
Actions #28

Updated by livdywan about 1 year ago

  • Description updated (diff)
Actions #29

Updated by mkittler about 1 year ago

When only connecting w39 and w40 with each other then ping -M do -s 1350 … works after setting the MTU on the worker hosts back to 1458. (With the current MTU limit of 1460 on the bridge and also using 1460 in the VM a ping with no fragmentation up to 1432 via ping -M do -s 1432 … is possible.)

Note that I did the following for just connecting those two hosts:

for i in {1..21}; do sudo ovs-vsctl del-port br1 gre$i ; done # on both hosts
sudo ovs-vsctl add-port br1 gre1 -- set interface gre1 type=gre options:remote_ip=10.145.10.12 # on w40
sudo ovs-vsctl add-port br1 gre1 -- set interface gre1 type=gre options:remote_ip=10.145.10.13 # on w39
Actions #30

Updated by mkittler about 1 year ago

  • Description updated (diff)
Actions #31

Updated by mkittler about 1 year ago

  • Description updated (diff)
Actions #32

Updated by mkittler about 1 year ago · Edited

After #152389#note-29 it works and executing

sudo ovs-vsctl --may-exist add-port br1 gre5 -- set interface gre5 type=gre options:remote_ip=10.100.101.74 # qesapworker-prg4.qa.suse.cz

on w39/w40 breaks it again.

After

sudo ovs-vsctl del-port br1 gre5

and waiting shortly it works again.

When adding worker38.oqa.prg2.suse.org and worker-arm1.oqa.prg2.suse.org one after another it keeps working. I noticed that when adding another host the ping is shortly getting stuck and "Destination Host Unreachable" may be repeated a few times. However, then the situation is resolving itself. So I re-conducted the test from before adding qesapworker-prg4.qa.suse.cz again and waited 5 minutes. Even then the situation was not resolving itself. So adding this host is definitely problematic (and I was not just too impatient). It also keeps working after adding qesapworker-prg5.qa.suse.cz so the qesapworker-prgX.qa.suse.cz workers are not generally problematic.

Actions #33

Updated by okurz about 1 year ago

  • Copied to action #152557: unexpected routing between PRG1/NUE2+PRG2 added
Actions #34

Updated by mkittler about 1 year ago

After scheduling MM jobs only on one worker the fail ratio declined a bit but not much: https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24&from=1702227764694&to=1702464789789

So I executed host=openqa.suse.de result="result='parallel_failed'" failed_since="2023-12-12" comment="https://progress.opensuse.org/issues/152389" ./openqa-advanced-retrigger-jobs to restart MM failures that haven't been restarted yet.

Not sure whether we should limit the execution of MM jobs also for ARM workers. They currently have definitely a quite high fail ratio:

openqa=> select distinct count(jobs.id) as total, sum(case when jobs.result in ('failed', 'incomplete') then 1 else 0 end) * 100. / count(jobs.id) as fail_rate_percent, host from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) join workers on jobs.assigned_worker_id = workers.id where dependency = 2 and t_finished >= '2023-12-12' group by host having count(jobs.id) > 50 order by fail_rate_percent desc;
 total |  fail_rate_percent  |    host     
-------+---------------------+-------------
   144 | 29.1666666666666667 | worker-arm2
   138 | 20.2898550724637681 | worker-arm1
  2442 |  6.2653562653562654 | worker38
(3 rows)
Actions #35

Updated by mkittler about 1 year ago · Edited

  • Description updated (diff)

Setting the MTU on the bridge to something very low (e.g. sudo ovs-vsctl set int br1 mtu_request=1000) doesn't help and actually doesn't change anything.


I re-conducted my tests from #152389#note-32 but this time attempting an ssh connection instead of the rather artificial ping. The result is the same. So I updated H2.5.1 to be accepted.


Maybe the routing problem (#152557 / https://sd.suse.com/servicedesk/customer/portal/1/SD-142223) is also the culprit here. I've been updating the hypotheses accordingly.

So I guess I'm mainly waiting on #152557 then. I'll nevertheless keep the ticket in progress as I'm still monitoring the situation and maybe do a few experiments when I have an idea.

Actions #36

Updated by mkittler about 1 year ago

  • Description updated (diff)
Actions #37

Updated by livdywan about 1 year ago

  • Subject changed from significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" to significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:M
  • Description updated (diff)
Actions #38

Updated by mkittler about 1 year ago · Edited

MR for removing problematic hosts from the GRE network: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/700

I removed the problematic hosts from the GRE network non-persistently as mentioned in the MR and scheduled some test jobs:

for i in 29 31 33 35 37 39 ; do sudo openqa-clone-job --skip-download --skip-chained-deps --within-instance http://openqa.suse.de/tests/13066500 WORKER_CLASS:qam_kernel_multipath=qemu_x86_64,worker$i WORKER_CLASS:qam_kernel_multipath_supportserver=qemu_x86_64,worker$((i + 1)) {BUILD,TEST}+=-poo152389 _GROUP=0 ; done
Cloning parents of sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath@64bit
Cloning parents of sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath_supportserver@64bit
Cloning children of sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath_supportserver@64bit
2 jobs have been created:
 - sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath_supportserver@64bit -> http://openqa.suse.de/tests/13072417
 - sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath@64bit -> http://openqa.suse.de/tests/13072416
Cloning parents of sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath@64bit
Cloning parents of sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath_supportserver@64bit
Cloning children of sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath_supportserver@64bit
2 jobs have been created:
 - sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath_supportserver@64bit -> http://openqa.suse.de/tests/13072419
 - sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath@64bit -> http://openqa.suse.de/tests/13072418
Cloning parents of sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath@64bit
Cloning parents of sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath_supportserver@64bit
Cloning children of sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath_supportserver@64bit
2 jobs have been created:
 - sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath_supportserver@64bit -> http://openqa.suse.de/tests/13072420
 - sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath@64bit -> http://openqa.suse.de/tests/13072421
Cloning parents of sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath@64bit
Cloning parents of sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath_supportserver@64bit
Cloning children of sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath_supportserver@64bit
2 jobs have been created:
 - sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath_supportserver@64bit -> http://openqa.suse.de/tests/13072423
 - sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath@64bit -> http://openqa.suse.de/tests/13072422
Cloning parents of sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath@64bit
Cloning parents of sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath_supportserver@64bit
Cloning children of sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath_supportserver@64bit
2 jobs have been created:
 - sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath_supportserver@64bit -> http://openqa.suse.de/tests/13072424
 - sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath@64bit -> http://openqa.suse.de/tests/13072425
Cloning parents of sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath@64bit
Cloning parents of sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath_supportserver@64bit
Cloning children of sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath_supportserver@64bit
2 jobs have been created:
 - sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath_supportserver@64bit -> http://openqa.suse.de/tests/13072426
 - sle-15-SP5-Server-DVD-Updates-x86_64-Build20231213-1-qam_kernel_multipath@64bit -> http://openqa.suse.de/tests/13072427

EDIT: All tests have passed: https://openqa.suse.de/tests/13072427#next_previous
Within my VMs SSH is also still working and ping with even higher sizes (still keeping -M do). So from my side we could go on with merging the MR.

Actions #39

Updated by mkittler about 1 year ago

  • Priority changed from Urgent to High

I'm lowering the priority as it already looks better. The fail ratio on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24&from=now-7d&to=now is not below 20 % but judging from the failures I briefly had a look at I'd say that there might also be problems that have nothing to do with the MTU/ping problem. The ping test issues and timeouts on TLS connections seem to be gone at least, e.g. the scenario mentioned in the ticket description passes now despite jobs being scheduled on different workers (https://openqa.suse.de/tests/13076014).

Actions #40

Updated by pcervinka about 1 year ago

Actions #41

Updated by mkittler about 1 year ago · Edited

I can also reproduce it in test VMs with my usual test setup. This time a size of 1350 actually works but 1400 is too much. SSH also doesn't work again.

I don't see the workers I previously found problematic and removed via https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/700 anymore (sudo salt -C 'G@roles:worker' cmd.run 'grep "10.100.101" /etc/wicked/scripts/gre_tunnel_preup.sh' shows no results). So that means it is not just that the removal of these workers was not persistent.

Actions #43

Updated by okurz about 1 year ago

Both merged. I executed now

env host=openqa.suse.de result="result='parallel_failed'" failed_since="2023-12-17" comment="label:https://progress.opensuse.org/issues/152389" openqa-advanced-retrigger-jobs
Actions #45

Updated by okurz about 1 year ago · Edited

Regarding the original problem it seems like the following. We have all openQA workers connected with GRE tunnels with point to point. With that all are inter-connected, also across locations, using GRE tunnels. Using STP to prevent loops that apparently causes inefficient routes like worker38 in PRG2 going over a FC Basement worker back to PRG2 which also means that we are limited with the maximum MTU that we can support. So my suggestions are the following

  1. DONE via https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18353 Make the use of MTU in os-autoinst-distri-opensuse configurable, i.e. the value that the support_server sets as well as the ping_size_check
  2. REJECTED Use the minimum MTU in both os-autoinst-distri-opensuse as well as https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls to be "on the safe side" for across-location tunneled GRE configurations
    • Likely not very useful because that MTU might be too low for certain test scenarios (e.g. Wireguard tunnel). So I guess there's rather a "safe middle" (and not a "safe side").
  3. REJECTED Only enable "tap" in our workerconf.sls within one location per one architecture, e.g. only ppc64le tap in NUE2, only x86_64 tap in PRG2
    • We already have that which I've just double-checked. However, that is not very helpful because thanks to our network topology relying on STP the shortest route might not be taken so traffic between ppc64le hosts might still go though an x86_64 host on a different site.
  4. Change https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls to only enable GRE tunnels on machines within the same datacenter, e.g. "location-prg" and rename accordingly in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls
  5. Follow #152737 regarding scheduling limited to individual zone
Actions #46

Updated by dheidler about 1 year ago

I did some package dumps using tcpdump and wireshark and checked the overhead of the tunnel.
We are adding 20 Bytes of IP header, 4 Bytes of GRE header and 14 Bytes of the inner Ethernet header.
So in totel we have 38 Bytes of overhead.
So if there are two workers in the same network and are using an MTU of 1500 on their ethernet interfaces with the path mtu being 1500 as well,
we have left an MTU of 1462 according to my calculation. Currently we seem to set it to 1458 by default in lib/mm_network.pm.
This makes sense if we take the 4 Byte VLAN tag into our calculation.

Actions #47

Updated by okurz almost 1 year ago

  • Related to action #152755: [tools] test fails in scc_registration - SCC not reachable despite not running multi-machine tests? size:M added
Actions #48

Updated by dheidler almost 1 year ago

As I described in https://sd.suse.com/servicedesk/customer/portal/1/SD-142688 we seem to have a path MTU of 1422 between NUE2 and PRG2.
So the infra maintained VPN tunnel between that locations produces 78 Bytes of overhead.
Our own GRE tunnel produces 42 Bytes of overhead if we want to use VLANs within our tunnel.
So if we would use an MTU of 1380 in our SUT VMs, we shouldn't run into issues even with inter location links.
(That is true at least as long as we use ipv4 to transport our GRE packets.)
Also 1380 is still enough to use IPv6 within our gre tunnel (IPv6 requires 1280 at least).

Actions #49

Updated by mkittler almost 1 year ago

  • Status changed from In Progress to Feedback

Just for the record, the fail ratio is getting better (with the current mitigations in place): https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24

I will not do any further changes before the Christmas break. If someone wants to try something, e.g. following up on the first points of #152389#note-45 (using the MTU size mentioned in #152389#note-48) feel free to take over.

Actions #51

Updated by okurz 12 months ago

  • Description updated (diff)
  • Due date changed from 2023-12-26 to 2024-01-19
  • Priority changed from High to Normal

I put the suggestions from #152389-45 into the description.

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18351 merged.

From
https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-2d&to=now&viewPanel=24
we see a significant improvement with parallel_failed+failed below 10% so we can monitor over the next week with lower prio.

In the meantime the open suggestions can still be executed.

Actions #53

Updated by mkittler 11 months ago

The PR has been merged.

The test scenario mentioned specifically in the ticket description (https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Updates&machine=64bit&test=qam_kernel_multipath&version=15-SP5) looks good. However, parallel_failed+failed on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-2d&to=now&viewPanel=24 is only slightly below 20 % right now.

I suppose I'll have to investigate what the remaining issues are and we can decide in the next infra call on further steps.

Actions #54

Updated by mkittler 11 months ago

Looks like many jobs were just failing because the cache queue on the server was full. However, those jobs were also restarted/cloned and usually passed eventually. (Yes, those jobs end up as incomplete which we don't consider here. However, other jobs in the cluster might end up as parallel failed (depending on timing it can also be skipped), see e.g. https://openqa.suse.de/tests/13205056#dependencies.)

Some jobs were also just retried via RETRY=1 which in fact worked.

So I ran the query behind the graph manually filtering out jobs that have been cloned:

openqa=> with mm_jobs as (select distinct id, result from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) where t_created >= (select timezone('UTC', now()) - interval '24 hour') and result != 'none' and dependency = 2 and clone_id is null) select result, round(count(id) * 100. / (select count(id) from mm_jobs), 2)::numeric(5,2)::float as ratio_mm from mm_jobs group by mm_jobs.result;
      result      | ratio_mm 
------------------+----------
 user_cancelled   |     0.21
 softfailed       |    12.42
 passed           |    69.49
 skipped          |     3.53
 timeout_exceeded |     0.21
 failed           |      3.1
 incomplete       |     2.25
 parallel_failed  |     8.78

With this we're actually significantly below 20 % again.

Suggestion for changing the graph's query to exclude cloned jobs: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1075

Actions #55

Updated by mkittler 11 months ago

This way the list of failing tests since the last few days isn't that long anymore:

openqa=> select distinct count(jobs.id), array_agg(jobs.id), (select name from job_groups where id = group_id), (array_agg(test))[1] as example_test from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) and clone_id is null where dependency = 2 and t_finished >= '2024-01-05' and result in ('failed', 'incomplete') and test not like '%:investigate:%' group by group_id order by count(jobs.id) desc;
 count |                                                                                                                      array_agg                                                                                                                       |                  name                  |                  ex
ample_test                  
-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------+--------------------
----------------------------
    27 | {13207621,13212384,13212409,13212411,13212413,13212391,13212393,13212382,13207606,13207619,13202969,13202997,13202986,13202989,13207617,13207593,13207600,13212395,13207626,13202976,13202971,13202993,13207589,13212379,13207624,13202995,13202982} | YaST Maintenance Updates - Development | mru-iscsi_client_no
rmal_auth_backstore_hdd_dev
     8 | {13207688,13207685,13203162,13198611,13203166,13198614,13212493,13212496}                                                                                                                                                                            | Test Security                          | fips_tests_xrdp_rem
ote-desktop-supportserver4
     6 | {13212364,13205300,13212387,13202921,13205457,13200618}                                                                                                                                                                                              | SAP/HA Maintenance Updates             | qam_ha_rolling_upgr
ade_migration_node01
     5 | {13216762,13217139,13215349,13215148,13217143}                                                                                                                                                                                                       | Maintenance - QR - SLE15SP5-SAP        | sles4sap_hana_node0
1
     3 | {13200538,13205121,13209956}                                                                                                                                                                                                                         | JeOS: Development                      | jeos-nfs-client
     3 | {13215098,13214992,13215064}                                                                                                                                                                                                                         | Maintenance - QR - SLE15SP5-Security   | fips_ker_stunnel_se
rver
     1 | {13217096}                                                                                                                                                                                                                                           |                                        | ha_hawk_haproxy_nod
e01
     1 | {13217531}                                                                                                                                                                                                                                           | HA  Development                        | ha_zalpha_node2_pub
lish
(8 rows)

I'll look into those tests tomorrow.

Actions #56

Updated by mkittler 11 months ago

The remaining failures are about missing assets and other non-networking-related issues. The only exception is the test module patch_and_reboot, e.g. https://openqa.suse.de/tests/13212384 (https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Updates&machine=64bit&test=mru-iscsi_client_normal_auth_backstore_lvm_dev&version=12-SP5). But this module seems broken since months.

Actions #57

Updated by okurz 11 months ago · Edited

https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24&from=1704964971248&to=1704966378145 shows a good decrease of incomplete+failed+parallel_failed so we are at 1+7+19=27% so that part is covered.

@mkittler I suggest to look into the open points of #152389-45 again

Actions #58

Updated by mkittler 11 months ago · Edited

  • Status changed from Feedback to In Progress

I updated #152389#note-45. The only remaining points are 4 and 5 and 5 is already just a reference to a follow-up ticket. So I'll try to implement 4 if it isn't difficult (and otherwise create a separate ticket for it).

EDIT: Drafts for 4: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1079, https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/705

If this is wanted I can continue with it. (The Jinja code probably doesn't work that way and I'll need to refactor it. However, the current draft should give an idea of it'll look like.)

Actions #59

Updated by mkittler 11 months ago

I guess https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1079 worked as intended but there are some remarks:

  • Now tap_poo… worker classes are no longer considered when creating the setup script for the GRE tunnel connections. So /etc/wicked/scripts/gre_tunnel_preup.sh only exists on worker38 and arm1 anymore at this point. The other aspects of the setup (e.g. creation of tap devices) are not affected, though. I think that's actually a good/acceptable change. Currently GRE tunnels are completely disabled anyways (see next point) so this change currently doesn't affect us at all anyway.
  • Maybe we could remove the "# Disabling GRE tunnels due to https://progress.opensuse.org/issues/152389" workaround to try with the split and thus now smaller GRE networks. Due to the previous point we currently only had one GRE network between arm-1 and worker38 as both are on the same location and both are the only workers with the production tap worker class. It would of course gain us not much as our MM capacity wouldn't go up.
  • There's no GRE network/config for PowerPC workers at all because mania is the only PowerPC worker with production tap class and also the only tap worker in general on its site. So also in this edge case the configuration (or rather absence of it) looks good.
Actions #60

Updated by mkittler 11 months ago

  • Status changed from In Progress to Feedback

MRs for rollback steps:

I removed the silence as the alerts would not fire anymore right now anyway and it would be good to be alerted in case merging the above changes caused problems.

@livdywan mentioned it would make sense to split these roll back steps into a separate ticket which makes sense considering how long this ticket has already been in progress. We can discuss that in tomorrow's infra daily.

Actions #61

Updated by okurz 11 months ago

mkittler wrote in #note-60:

@livdywan mentioned it would make sense to split these roll back steps into a separate ticket which makes sense considering how long this ticket has already been in progress. We can discuss that in tomorrow's infra daily.

Well, considering that you already created the merge requests to revert and as you already removed the silence I suggest to give it a go and merge both MRs at your covenience but closely monitor today+tomorrow. If there is any problem then revert again and plan the rollbacks in separate ticket(s).

Actions #62

Updated by okurz 11 months ago

  • Status changed from Feedback to In Progress
  • Priority changed from Normal to Urgent

https://openqa.suse.de/tests/13263418#step/iscsi_client/32 shows problematic. I suggest to revert your latest changes and please fix the according test failures.

Actions #63

Updated by okurz 11 months ago · Edited

mkittler triggered a reboot for all OSD workers and now I did

env host=openqa.suse.de result="result='parallel_failed'" failed_since="2024-01-16 12:00Z" comment="label:poo#152389" ./openqa-advanced-retrigger-jobs

which retriggered about 300 jobs

EDIT: Jobs like https://openqa.suse.de/tests/13263920 look ok again as well as https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-3h&to=now&viewPanel=24
@mkittler as you already assumed that a reboot would fix the issue maybe we can come up with a better approach to ensure the according network parts are actively reloaded/restarted to prevent a similar issue in the future?

Actions #64

Updated by mkittler 11 months ago · Edited

  • Priority changed from Urgent to Normal

Yes, the jobs look good again - despite being scheduled across multiple workers. So I'm lowering the priority again.

maybe we can come up with a better approach to ensure the according network parts are actively reloaded/restarted to prevent a similar issue in the future?

It would have already helped to re-run the setup script again, e.g. sudo salt -C 'G@roles:worker' cmd.run '/etc/wicked/scripts/gre_tunnel_preup.sh update br1'. We could automate this of course by adding an according salt state that runs whenever that script changes. The script already deletes all existing ports before adding new ones so this shouldn't be problematic. It may still disrupt tests, though. At least when I did experiments like #152389#note-32 I always had to wait shortly until everything worked again after deleting/adding ports. (That's probably because STP needs to do its job which doesn't happen instantly.)

Actions #65

Updated by okurz 11 months ago

cat /etc/wicked/scripts/gre_tunnel_preup.sh update br1? are you missing a pipe?

Actions #66

Updated by mkittler 11 months ago

No, but I had a cat too much :-)
(I edited the comment. I only copied that from the history where I used cat to check the file contents.)

Actions #67

Updated by mkittler 11 months ago

  • Status changed from In Progress to Resolved

I followed all rollback steps and the fail ratio looks acceptable. I created #153769 as a follow-up for the problem mentioned in previous comments.

Actions #68

Updated by mkittler 11 months ago

  • Related to action #153769: Better handle changes in GRE tunnel configuration size:M added
Actions #69

Updated by okurz 11 months ago

  • Copied to action #153880: https://openqa.suse.de/tests/13277880#step/patterns/96 not being able to resolve download.suse.de, likely DNS problems in PRG1 added
Actions #70

Updated by okurz 11 months ago

  • Due date deleted (2024-01-19)
Actions #71

Updated by okurz 11 months ago

  • Related to action #154552: [ppc64le] test fails in iscsi_client - zypper reports Error Message: Could not resolve host: openqa.suse.de added
Actions #72

Updated by okurz 7 months ago

  • Copied to action #160652: Secondary TAP worker class in different zones size:S added
Actions

Also available in: Atom PDF