Project

General

Profile

Actions

action #178576

open

coordination #161414: [epic] Improved salt based infrastructure management

Workers unresponsive in salt pipelines including openqa-piworker, sapworker1 and monitor size:S

Added by livdywan 21 days ago. Updated 4 days ago.

Status:
In Progress
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2025-03-07
Due date:
2025-04-04 (Due in 4 days)
% Done:

0%

Estimated time:

Description

Observation

See https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3959809

openqa-piworker.qe.nue2.suse.org:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20250310104340880948
sapworker1.qe.nue2.suse.org:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20250310104340880948
monitor.qe.nue2.suse.org:
    Minion did not return. [No response]

Suggestions

  • This seems to be reproducible

Rollback actions

  • Add back to salt and production: diesel,petrol,monitor,sapworker1,openqa-piworker for i in diesel.qe.nue2.suse.org petrol.qe.nue2.suse.org monitor.qe.nue2.suse.org sapworker1.qe.nue2.suse.org openqa-piworker.qe.nue2.suse.org ; do sudo salt-key -y -a $i; done
  • Remove silence Systemd services from https://monitor.qa.suse.de/alerting/silences

Related issues 1 (1 open0 closed)

Copied to openQA Infrastructure (public) - action #179302: Better monitoring for correct MTU size limitsNew2025-03-07

Actions
Actions #1

Updated by nicksinger 21 days ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger
Actions #2

Updated by nicksinger 21 days ago

I already checked network performance between petrol<->OSD as part of https://progress.opensuse.org/issues/178567 but found no apparent problem. @jbaier_cz mentioned a recent "Network Maintenance" last Friday (https://suse.slack.com/archives/C02AET1AAAD/p1741268343150269) which sounds highly related but not confirmed. To get further assistance I created https://sd.suse.com/servicedesk/customer/portal/1/SD-182364

Actions #4

Updated by openqa_review 20 days ago

  • Due date set to 2025-03-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by nicksinger 20 days ago · Edited

  • Status changed from In Progress to Feedback

I used sapworker1 as reproducer for the issue at hand while having a terminal open with salt-minion -l debug (stopped the service before).
Logs showed a zeromq connection to OSD TCP/4505 while OSD connects back to the worker at TCP/4506. I used salt sapworker1.qe.nue2.suse.org saltutil.sync_grains on OSD to quickly trigger the issue. Salt looked fine but all of a sudden it stopped working and refused to talk back to OSD answering the "sync_grains"-request. Using wireshark and the remote-ssh-tracing, I found a lot of "TCP retransmissions" on the prg2wg interface. A noticeable pattern was the last packet before retransmissions always showing the max MTU of 1420 (the default of a wg-interface). This hinted to MTU problems. Using ping -I prg2wg -M do -s $((1420-28)) openqa.suse.de I was quickly able to reproduce the problem again - packets go out but no answer comes back (not even an ICMP-message signaling a need for fragmentation). Slowly lowering this 1420 bytes I was able to find the maximum path size to be 1284 with ping/ICMP, so 28 bytes more for TCP, meaning 1312 for the prg2wg-interface. I added my findings to the SD ticket because I still suspect the recent network changes to be the cause of this ("IPSec tunnels got adjusted" -> each tunnel usually lowers MTUs). As a workaround or possible configuration fix, I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1404 which currently fails pipelines because of #178564

Actions #6

Updated by livdywan 20 days ago

  • Subject changed from Workers unresponsive in salt pipelines including openqa-piworker, sapworker1 and monitor to Workers unresponsive in salt pipelines including openqa-piworker, sapworker1 and monitor size:S
Actions #7

Updated by nicksinger 19 days ago

  • Priority changed from High to Low

I applied the fix manually on all NUE2 machines with a wg tunnel. With that https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3964772 passed and salt is able to properly deploy again. I'm keeping the MR (and ticket here) open until #178564 is fixed because I would like to see proper tests first.

Actions #8

Updated by nicksinger 11 days ago

  • Status changed from Feedback to In Progress
  • Priority changed from Low to High

Merged but unfortunately running on too many workers and therefore breaking our pipeline: https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/4007282#L154 - trying to come up with a fix.

Actions #9

Updated by nicksinger 11 days ago

  • Due date deleted (2025-03-25)
  • Status changed from In Progress to Resolved

Deployment of https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1404 failed due to missing needs_wireguard-grains on backup-vm.qe.nue2.suse.org and baremetal-support.qe.nue2.suse.org - I now set it to "False" on these hosts. With a subsequent deployment in a different MR, the change got applied to all machines:

openqa:~ # salt -C '*.nue2.suse.org and not G@needs_wireguard:False' cmd.run 'cat /etc/wireguard/prg2wg.conf | grep -i mtu'
monitor.qe.nue2.suse.org:
    MTU = 1292
petrol.qe.nue2.suse.org:
    MTU = 1292
diesel.qe.nue2.suse.org:
    MTU = 1292
openqa-piworker.qe.nue2.suse.org:
    MTU = 1292
sapworker1.qe.nue2.suse.org:
    MTU = 1292
Actions #10

Updated by okurz 10 days ago · Edited

  • Status changed from Resolved to In Progress

As discussed with nicksinger reopening. There are multiple pipelines either stalling and running into 2h timeout or weird salt issues: https://gitlab.suse.de/openqa/salt-states-openqa/-/pipelines

I tried with sudo count-fail-ratio salt \* saltutil.sync_grains and found no failure: count-fail-ratio: Run: 20. Fails: 0. Fail ratio 0±0%. No fails, computed failure probability < 15.00%. Probably we need to apply the complete high state.

Actions #11

Updated by okurz 10 days ago

  • Parent task set to #161414
Actions #12

Updated by okurz 10 days ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1413 (merged). Next step: Contact IT regarding expected MTU size. I will report a separate ticket about better MTU based monitoring.

Actions #13

Updated by okurz 10 days ago

  • Copied to action #179302: Better monitoring for correct MTU size limits added
Actions #14

Updated by okurz 10 days ago

  • Description updated (diff)

I tried to manually ensure that the MTU of 1272 is deployed on all relevant machines and partially triggered reboots. Over salt the machines are reachable but over ssh only with a delay of multiple minutes, possibly as the lower MTU is not effective yet. Also couldn't find a stable way to start/restart the wireguard tunnels. Also getting the error "RTNETLINK answers: No such device". Could it be that https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1404/ removed some devices or something?
Anyway, to not constantly fail salt pipelines I have for now removed all NUE2 wg hosts from salt and added a silence. Mind the corresponding rollback actions please

Actions #15

Updated by okurz 10 days ago

  • Description updated (diff)
Actions #16

Updated by openqa_review 10 days ago

  • Due date set to 2025-04-04

Setting due date based on mean cycle time of SUSE QE Tools

Actions #17

Updated by okurz 7 days ago

  • Priority changed from High to Urgent

From mmoese in https://suse.slack.com/archives/C02CANHLANP/p1742810337597879 #eng-testing

@qa-tools I see a lot of jobs being abandoned by sapworker1 (like https://openqa.suse.de/tests/17128750 )

I am convinced this is related to this ticket, raising to "Urgent"

Actions #18

Updated by nicksinger 6 days ago

I changed the MTU for diesel, monitor, openqa-piworker, petrol and sapworker1 according to the output of echo /etc/sysconfig/network/ifcfg-`ip -j r s default | jq -r ".[0].dev"` and restarted the network-service on them. I determined this again by using ping -6 -I em1 -M do -s $((1360-8-40)) openqa.suse.de (on sapworker1) but using the "outer interface" this time. Removing the "PreUp="- and "PostUp="-statements in /etc/wireguard/prg2wg.conf makes the tunnel restart way faster. I also removed the "MTU="-statement again as the PMTU-discovery (or auto-calculation, not sure) should work again properly. After restarting the wg-interface, this is indeed the case:

sapworker1:/etc/wireguard # ip link show dev prg2wg
154: prg2wg: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1280 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/none
sapworker1:/etc/wireguard # ping -6 -I prg2wg -M do -s $((1280-8-40)) openqa.suse.de
PING openqa.suse.de (2a07:de40:b203:12:0:ff:fe4f:7c2b) from 2a07:de40:b21b:3:10:144:169:4 prg2wg: 1232 data bytes
1240 bytes from openqa.suse.de (2a07:de40:b203:12:0:ff:fe4f:7c2b): icmp_seq=1 ttl=62 time=5.94 ms
1240 bytes from openqa.suse.de (2a07:de40:b203:12:0:ff:fe4f:7c2b): icmp_seq=2 ttl=62 time=5.85 ms
1240 bytes from openqa.suse.de (2a07:de40:b203:12:0:ff:fe4f:7c2b): icmp_seq=3 ttl=62 time=5.75 ms
^C
--- openqa.suse.de ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 5.746/5.846/5.940/0.079 ms

I've accepted the salt-key for monitor again on OSD and test with this machine now if everything is stable again.

Actions #19

Updated by nicksinger 6 days ago

Seems to work. Before moving on to adding this into our salt, I verify now with openqa:/srv/salt/wireguard # runs=100 count-fail-ratio salt 'monitor.qe.nue2.suse.org' state.apply running in tmux session 0 on OSD.

Actions #20

Updated by nicksinger 6 days ago

finished with:

## count-fail-ratio: Run: 100. Fails: 7. Fail ratio 7.00±5.00%
## mean runtime: 100813±26821.26 ms

The fails are all instances of #176949 so going forward with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1416 should be fine. I didn't add them to salt yet because the tunnel MTU in salt would break them anyway.

Actions #21

Updated by livdywan 5 days ago

See team chat

Reintroduce explicit MTU setting for wireguard tunnels: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1420 (merged)

Actions #22

Updated by okurz 5 days ago

Failures of the day:

Actions #23

Updated by nicksinger 4 days ago

  • Priority changed from Urgent to High

So for now most of the machines are online and in salt again without wireguard enabled but with proper MTU settings. It turned out that 1280 is the minimum MTU required for v6 to work. This is the reason why https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1420 broke a lot over night. I run an experiment on sapworker1 with having NFS disabled. With "openqa_share_nfs: False" in /etc/salt/grains we can enable sapworker1 in salt again. https://openqa.suse.de/tests/17138601 is the first test in this setup. https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/4044163#L115 is failing again due to XML-errors but passed sapworker1 quite fast at least.
The same for diesel. I will apply the same on petrol tomorrow after reviewing some openQA jobs running on these machines.

Actions

Also available in: Atom PDF