action #178576
opencoordination #161414: [epic] Improved salt based infrastructure management
Workers unresponsive in salt pipelines including openqa-piworker, sapworker1 and monitor size:S
0%
Description
Observation¶
See https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3959809
openqa-piworker.qe.nue2.suse.org:
Minion did not return. [No response]
The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
salt-run jobs.lookup_jid 20250310104340880948
sapworker1.qe.nue2.suse.org:
Minion did not return. [No response]
The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
salt-run jobs.lookup_jid 20250310104340880948
monitor.qe.nue2.suse.org:
Minion did not return. [No response]
Suggestions¶
- This seems to be reproducible
Rollback actions¶
- Add back to salt and production: diesel,petrol,monitor,sapworker1,openqa-piworker
for i in diesel.qe.nue2.suse.org petrol.qe.nue2.suse.org monitor.qe.nue2.suse.org sapworker1.qe.nue2.suse.org openqa-piworker.qe.nue2.suse.org ; do sudo salt-key -y -a $i; done
- Remove silence
Systemd services
from https://monitor.qa.suse.de/alerting/silences
Updated by nicksinger 21 days ago
- Status changed from New to In Progress
- Assignee set to nicksinger
Updated by nicksinger 21 days ago
I already checked network performance between petrol<->OSD as part of https://progress.opensuse.org/issues/178567 but found no apparent problem. @jbaier_cz mentioned a recent "Network Maintenance" last Friday (https://suse.slack.com/archives/C02AET1AAAD/p1741268343150269) which sounds highly related but not confirmed. To get further assistance I created https://sd.suse.com/servicedesk/customer/portal/1/SD-182364
Updated by openqa_review 20 days ago
- Due date set to 2025-03-25
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 20 days ago · Edited
- Status changed from In Progress to Feedback
I used sapworker1 as reproducer for the issue at hand while having a terminal open with salt-minion -l debug
(stopped the service before).
Logs showed a zeromq connection to OSD TCP/4505 while OSD connects back to the worker at TCP/4506. I used salt sapworker1.qe.nue2.suse.org saltutil.sync_grains
on OSD to quickly trigger the issue. Salt looked fine but all of a sudden it stopped working and refused to talk back to OSD answering the "sync_grains"-request. Using wireshark and the remote-ssh-tracing, I found a lot of "TCP retransmissions" on the prg2wg interface. A noticeable pattern was the last packet before retransmissions always showing the max MTU of 1420 (the default of a wg-interface). This hinted to MTU problems. Using ping -I prg2wg -M do -s $((1420-28)) openqa.suse.de
I was quickly able to reproduce the problem again - packets go out but no answer comes back (not even an ICMP-message signaling a need for fragmentation). Slowly lowering this 1420 bytes I was able to find the maximum path size to be 1284 with ping/ICMP, so 28 bytes more for TCP, meaning 1312 for the prg2wg-interface. I added my findings to the SD ticket because I still suspect the recent network changes to be the cause of this ("IPSec tunnels got adjusted" -> each tunnel usually lowers MTUs). As a workaround or possible configuration fix, I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1404 which currently fails pipelines because of #178564
Updated by nicksinger 19 days ago
- Priority changed from High to Low
I applied the fix manually on all NUE2 machines with a wg tunnel. With that https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3964772 passed and salt is able to properly deploy again. I'm keeping the MR (and ticket here) open until #178564 is fixed because I would like to see proper tests first.
Updated by nicksinger 11 days ago
- Status changed from Feedback to In Progress
- Priority changed from Low to High
Merged but unfortunately running on too many workers and therefore breaking our pipeline: https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/4007282#L154 - trying to come up with a fix.
Updated by nicksinger 11 days ago
- Due date deleted (
2025-03-25) - Status changed from In Progress to Resolved
Deployment of https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1404 failed due to missing needs_wireguard
-grains on backup-vm.qe.nue2.suse.org and baremetal-support.qe.nue2.suse.org - I now set it to "False" on these hosts. With a subsequent deployment in a different MR, the change got applied to all machines:
openqa:~ # salt -C '*.nue2.suse.org and not G@needs_wireguard:False' cmd.run 'cat /etc/wireguard/prg2wg.conf | grep -i mtu'
monitor.qe.nue2.suse.org:
MTU = 1292
petrol.qe.nue2.suse.org:
MTU = 1292
diesel.qe.nue2.suse.org:
MTU = 1292
openqa-piworker.qe.nue2.suse.org:
MTU = 1292
sapworker1.qe.nue2.suse.org:
MTU = 1292
Updated by okurz 10 days ago · Edited
- Status changed from Resolved to In Progress
As discussed with nicksinger reopening. There are multiple pipelines either stalling and running into 2h timeout or weird salt issues: https://gitlab.suse.de/openqa/salt-states-openqa/-/pipelines
I tried with sudo count-fail-ratio salt \* saltutil.sync_grains
and found no failure: count-fail-ratio: Run: 20. Fails: 0. Fail ratio 0±0%. No fails, computed failure probability < 15.00%
. Probably we need to apply the complete high state.
Updated by okurz 10 days ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1413 (merged). Next step: Contact IT regarding expected MTU size. I will report a separate ticket about better MTU based monitoring.
Updated by okurz 10 days ago
- Copied to action #179302: Better monitoring for correct MTU size limits added
Updated by okurz 10 days ago
- Description updated (diff)
I tried to manually ensure that the MTU of 1272 is deployed on all relevant machines and partially triggered reboots. Over salt the machines are reachable but over ssh only with a delay of multiple minutes, possibly as the lower MTU is not effective yet. Also couldn't find a stable way to start/restart the wireguard tunnels. Also getting the error "RTNETLINK answers: No such device". Could it be that https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1404/ removed some devices or something?
Anyway, to not constantly fail salt pipelines I have for now removed all NUE2 wg hosts from salt and added a silence. Mind the corresponding rollback actions please
Updated by openqa_review 10 days ago
- Due date set to 2025-04-04
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 7 days ago
- Priority changed from High to Urgent
From mmoese in https://suse.slack.com/archives/C02CANHLANP/p1742810337597879 #eng-testing
@qa-tools I see a lot of jobs being abandoned by sapworker1 (like https://openqa.suse.de/tests/17128750 )
I am convinced this is related to this ticket, raising to "Urgent"
Updated by nicksinger 6 days ago
I changed the MTU for diesel, monitor, openqa-piworker, petrol and sapworker1 according to the output of echo /etc/sysconfig/network/ifcfg-`ip -j r s default | jq -r ".[0].dev"`
and restarted the network-service on them. I determined this again by using ping -6 -I em1 -M do -s $((1360-8-40)) openqa.suse.de
(on sapworker1) but using the "outer interface" this time. Removing the "PreUp="- and "PostUp="-statements in /etc/wireguard/prg2wg.conf
makes the tunnel restart way faster. I also removed the "MTU="-statement again as the PMTU-discovery (or auto-calculation, not sure) should work again properly. After restarting the wg-interface, this is indeed the case:
sapworker1:/etc/wireguard # ip link show dev prg2wg
154: prg2wg: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1280 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/none
sapworker1:/etc/wireguard # ping -6 -I prg2wg -M do -s $((1280-8-40)) openqa.suse.de
PING openqa.suse.de (2a07:de40:b203:12:0:ff:fe4f:7c2b) from 2a07:de40:b21b:3:10:144:169:4 prg2wg: 1232 data bytes
1240 bytes from openqa.suse.de (2a07:de40:b203:12:0:ff:fe4f:7c2b): icmp_seq=1 ttl=62 time=5.94 ms
1240 bytes from openqa.suse.de (2a07:de40:b203:12:0:ff:fe4f:7c2b): icmp_seq=2 ttl=62 time=5.85 ms
1240 bytes from openqa.suse.de (2a07:de40:b203:12:0:ff:fe4f:7c2b): icmp_seq=3 ttl=62 time=5.75 ms
^C
--- openqa.suse.de ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 5.746/5.846/5.940/0.079 ms
I've accepted the salt-key for monitor again on OSD and test with this machine now if everything is stable again.
Updated by nicksinger 6 days ago
Seems to work. Before moving on to adding this into our salt, I verify now with openqa:/srv/salt/wireguard # runs=100 count-fail-ratio salt 'monitor.qe.nue2.suse.org' state.apply
running in tmux session 0 on OSD.
Updated by nicksinger 6 days ago
finished with:
## count-fail-ratio: Run: 100. Fails: 7. Fail ratio 7.00±5.00%
## mean runtime: 100813±26821.26 ms
The fails are all instances of #176949 so going forward with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1416 should be fine. I didn't add them to salt yet because the tunnel MTU in salt would break them anyway.
Updated by okurz 5 days ago
Failures of the day:
- https://gitlab.suse.de/openqa/osd-deployment/-/jobs/4041080#L6111 showing sapworker1 stuck as "System management is locked by the application with pid 42111 (zypper)". As I can't even login normally over ssh (takes long and only "reacts" on ctrl-c, showing a bash prompt without proper PS1. Or using
ssh -4
which also takes long but eventually reacts) I assume this is related to this ticket - https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/4039441 running into 2h timeout
Updated by nicksinger 4 days ago
- Priority changed from Urgent to High
So for now most of the machines are online and in salt again without wireguard enabled but with proper MTU settings. It turned out that 1280 is the minimum MTU required for v6 to work. This is the reason why https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1420 broke a lot over night. I run an experiment on sapworker1 with having NFS disabled. With "openqa_share_nfs: False" in /etc/salt/grains we can enable sapworker1 in salt again. https://openqa.suse.de/tests/17138601 is the first test in this setup. https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/4044163#L115 is failing again due to XML-errors but passed sapworker1 quite fast at least.
The same for diesel. I will apply the same on petrol tomorrow after reviewing some openQA jobs running on these machines.