Project

General

Profile

Actions

action #178576

closed

coordination #161414: [epic] Improved salt based infrastructure management

Workers unresponsive in salt pipelines including openqa-piworker, sapworker1 and monitor size:S

Added by livdywan about 1 month ago. Updated 15 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2025-03-07
Due date:
% Done:

0%

Estimated time:

Description

Observation

See https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3959809

openqa-piworker.qe.nue2.suse.org:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20250310104340880948
sapworker1.qe.nue2.suse.org:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20250310104340880948
monitor.qe.nue2.suse.org:
    Minion did not return. [No response]

Suggestions

  • This seems to be reproducible

Rollback actions

  • DONE Add back to salt and production: diesel,petrol,monitor,sapworker1,openqa-piworker for i in diesel.qe.nue2.suse.org petrol.qe.nue2.suse.org monitor.qe.nue2.suse.org sapworker1.qe.nue2.suse.org openqa-piworker.qe.nue2.suse.org ; do sudo salt-key -y -a $i; done
  • DONE Remove silence Systemd services from https://monitor.qa.suse.de/alerting/silences

Related issues 3 (2 open1 closed)

Related to openQA Infrastructure (public) - action #180194: Workers diesel and petrol are down which blocks the salt deployment pipeline size:SResolvednicksinger2025-04-082025-04-24

Actions
Copied to openQA Infrastructure (public) - action #179302: Better monitoring for correct MTU size limitsNew2025-03-07

Actions
Copied to openQA Infrastructure (public) - action #180122: Run openqa-piworker as part of our infrastructure while still being CC compliantNew

Actions
Actions #1

Updated by nicksinger about 1 month ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger
Actions #2

Updated by nicksinger about 1 month ago

I already checked network performance between petrol<->OSD as part of https://progress.opensuse.org/issues/178567 but found no apparent problem. @jbaier_cz mentioned a recent "Network Maintenance" last Friday (https://suse.slack.com/archives/C02AET1AAAD/p1741268343150269) which sounds highly related but not confirmed. To get further assistance I created https://sd.suse.com/servicedesk/customer/portal/1/SD-182364

Actions #4

Updated by openqa_review about 1 month ago

  • Due date set to 2025-03-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by nicksinger about 1 month ago · Edited

  • Status changed from In Progress to Feedback

I used sapworker1 as reproducer for the issue at hand while having a terminal open with salt-minion -l debug (stopped the service before).
Logs showed a zeromq connection to OSD TCP/4505 while OSD connects back to the worker at TCP/4506. I used salt sapworker1.qe.nue2.suse.org saltutil.sync_grains on OSD to quickly trigger the issue. Salt looked fine but all of a sudden it stopped working and refused to talk back to OSD answering the "sync_grains"-request. Using wireshark and the remote-ssh-tracing, I found a lot of "TCP retransmissions" on the prg2wg interface. A noticeable pattern was the last packet before retransmissions always showing the max MTU of 1420 (the default of a wg-interface). This hinted to MTU problems. Using ping -I prg2wg -M do -s $((1420-28)) openqa.suse.de I was quickly able to reproduce the problem again - packets go out but no answer comes back (not even an ICMP-message signaling a need for fragmentation). Slowly lowering this 1420 bytes I was able to find the maximum path size to be 1284 with ping/ICMP, so 28 bytes more for TCP, meaning 1312 for the prg2wg-interface. I added my findings to the SD ticket because I still suspect the recent network changes to be the cause of this ("IPSec tunnels got adjusted" -> each tunnel usually lowers MTUs). As a workaround or possible configuration fix, I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1404 which currently fails pipelines because of #178564

Actions #6

Updated by livdywan about 1 month ago

  • Subject changed from Workers unresponsive in salt pipelines including openqa-piworker, sapworker1 and monitor to Workers unresponsive in salt pipelines including openqa-piworker, sapworker1 and monitor size:S
Actions #7

Updated by nicksinger about 1 month ago

  • Priority changed from High to Low

I applied the fix manually on all NUE2 machines with a wg tunnel. With that https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3964772 passed and salt is able to properly deploy again. I'm keeping the MR (and ticket here) open until #178564 is fixed because I would like to see proper tests first.

Actions #8

Updated by nicksinger about 1 month ago

  • Status changed from Feedback to In Progress
  • Priority changed from Low to High

Merged but unfortunately running on too many workers and therefore breaking our pipeline: https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/4007282#L154 - trying to come up with a fix.

Actions #9

Updated by nicksinger about 1 month ago

  • Due date deleted (2025-03-25)
  • Status changed from In Progress to Resolved

Deployment of https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1404 failed due to missing needs_wireguard-grains on backup-vm.qe.nue2.suse.org and baremetal-support.qe.nue2.suse.org - I now set it to "False" on these hosts. With a subsequent deployment in a different MR, the change got applied to all machines:

openqa:~ # salt -C '*.nue2.suse.org and not G@needs_wireguard:False' cmd.run 'cat /etc/wireguard/prg2wg.conf | grep -i mtu'
monitor.qe.nue2.suse.org:
    MTU = 1292
petrol.qe.nue2.suse.org:
    MTU = 1292
diesel.qe.nue2.suse.org:
    MTU = 1292
openqa-piworker.qe.nue2.suse.org:
    MTU = 1292
sapworker1.qe.nue2.suse.org:
    MTU = 1292
Actions #10

Updated by okurz about 1 month ago · Edited

  • Status changed from Resolved to In Progress

As discussed with nicksinger reopening. There are multiple pipelines either stalling and running into 2h timeout or weird salt issues: https://gitlab.suse.de/openqa/salt-states-openqa/-/pipelines

I tried with sudo count-fail-ratio salt \* saltutil.sync_grains and found no failure: count-fail-ratio: Run: 20. Fails: 0. Fail ratio 0±0%. No fails, computed failure probability < 15.00%. Probably we need to apply the complete high state.

Actions #11

Updated by okurz about 1 month ago

  • Parent task set to #161414
Actions #12

Updated by okurz about 1 month ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1413 (merged). Next step: Contact IT regarding expected MTU size. I will report a separate ticket about better MTU based monitoring.

Actions #13

Updated by okurz about 1 month ago

  • Copied to action #179302: Better monitoring for correct MTU size limits added
Actions #14

Updated by okurz about 1 month ago

  • Description updated (diff)

I tried to manually ensure that the MTU of 1272 is deployed on all relevant machines and partially triggered reboots. Over salt the machines are reachable but over ssh only with a delay of multiple minutes, possibly as the lower MTU is not effective yet. Also couldn't find a stable way to start/restart the wireguard tunnels. Also getting the error "RTNETLINK answers: No such device". Could it be that https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1404/ removed some devices or something?
Anyway, to not constantly fail salt pipelines I have for now removed all NUE2 wg hosts from salt and added a silence. Mind the corresponding rollback actions please

Actions #15

Updated by okurz about 1 month ago

  • Description updated (diff)
Actions #16

Updated by openqa_review about 1 month ago

  • Due date set to 2025-04-04

Setting due date based on mean cycle time of SUSE QE Tools

Actions #17

Updated by okurz 30 days ago

  • Priority changed from High to Urgent

From mmoese in https://suse.slack.com/archives/C02CANHLANP/p1742810337597879 #eng-testing

@qa-tools I see a lot of jobs being abandoned by sapworker1 (like https://openqa.suse.de/tests/17128750 )

I am convinced this is related to this ticket, raising to "Urgent"

Actions #18

Updated by nicksinger 30 days ago

I changed the MTU for diesel, monitor, openqa-piworker, petrol and sapworker1 according to the output of echo /etc/sysconfig/network/ifcfg-`ip -j r s default | jq -r ".[0].dev"` and restarted the network-service on them. I determined this again by using ping -6 -I em1 -M do -s $((1360-8-40)) openqa.suse.de (on sapworker1) but using the "outer interface" this time. Removing the "PreUp="- and "PostUp="-statements in /etc/wireguard/prg2wg.conf makes the tunnel restart way faster. I also removed the "MTU="-statement again as the PMTU-discovery (or auto-calculation, not sure) should work again properly. After restarting the wg-interface, this is indeed the case:

sapworker1:/etc/wireguard # ip link show dev prg2wg
154: prg2wg: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1280 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/none
sapworker1:/etc/wireguard # ping -6 -I prg2wg -M do -s $((1280-8-40)) openqa.suse.de
PING openqa.suse.de (2a07:de40:b203:12:0:ff:fe4f:7c2b) from 2a07:de40:b21b:3:10:144:169:4 prg2wg: 1232 data bytes
1240 bytes from openqa.suse.de (2a07:de40:b203:12:0:ff:fe4f:7c2b): icmp_seq=1 ttl=62 time=5.94 ms
1240 bytes from openqa.suse.de (2a07:de40:b203:12:0:ff:fe4f:7c2b): icmp_seq=2 ttl=62 time=5.85 ms
1240 bytes from openqa.suse.de (2a07:de40:b203:12:0:ff:fe4f:7c2b): icmp_seq=3 ttl=62 time=5.75 ms
^C
--- openqa.suse.de ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 5.746/5.846/5.940/0.079 ms

I've accepted the salt-key for monitor again on OSD and test with this machine now if everything is stable again.

Actions #19

Updated by nicksinger 30 days ago

Seems to work. Before moving on to adding this into our salt, I verify now with openqa:/srv/salt/wireguard # runs=100 count-fail-ratio salt 'monitor.qe.nue2.suse.org' state.apply running in tmux session 0 on OSD.

Actions #20

Updated by nicksinger 29 days ago

finished with:

## count-fail-ratio: Run: 100. Fails: 7. Fail ratio 7.00±5.00%
## mean runtime: 100813±26821.26 ms

The fails are all instances of #176949 so going forward with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1416 should be fine. I didn't add them to salt yet because the tunnel MTU in salt would break them anyway.

Actions #21

Updated by livdywan 28 days ago

See team chat

Reintroduce explicit MTU setting for wireguard tunnels: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1420 (merged)

Actions #22

Updated by okurz 28 days ago

Failures of the day:

Actions #23

Updated by nicksinger 27 days ago

  • Priority changed from Urgent to High

So for now most of the machines are online and in salt again without wireguard enabled but with proper MTU settings. It turned out that 1280 is the minimum MTU required for v6 to work. This is the reason why https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1420 broke a lot over night. I run an experiment on sapworker1 with having NFS disabled. With "openqa_share_nfs: False" in /etc/salt/grains we can enable sapworker1 in salt again. https://openqa.suse.de/tests/17138601 is the first test in this setup. https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/4044163#L115 is failing again due to XML-errors but passed sapworker1 quite fast at least.
The same for diesel. I will apply the same on petrol tomorrow after reviewing some openQA jobs running on these machines.

Actions #24

Updated by okurz 22 days ago

  • Priority changed from High to Urgent
Actions #25

Updated by livdywan 21 days ago

openqa-piworker.qe.nue2.suse.org:
Minion did not return. [Not connected]

Briefly checked in the daily. @okurz will mitigate by dropping piworker from salt to alleviate the urgency

Actions #26

Updated by okurz 21 days ago

  • Assignee changed from nicksinger to okurz

I removed openqa-piworker from salt again as it was blocking https://gitlab.suse.de/openqa/osd-deployment/-/jobs/4080187 . Same for petrol for https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/4063615#L478.
The machines are still listed in "rollback actions" so good enough. I retriggered the failed osd-deployment and salt-states and will monitor it

Actions #27

Updated by okurz 21 days ago

  • Assignee changed from okurz to nicksinger
  • Priority changed from Urgent to High
Actions #28

Updated by nicksinger 19 days ago

diesel was disabled too and I enabled it again already 2 days ago and verified with count-fail-ratio again (had 100% success). Doing the same for diesel now currently at 100% success on 14 runs - looks promising.

Actions #29

Updated by nicksinger 19 days ago

openqa:~ # runs=40 count-fail-ratio salt --static 'petrol.qe.nue2.suse.org' state.apply
[…]
## count-fail-ratio: Run: 40. Fails: 0. Fail ratio 0±0%. No fails, computed failure probability < 7.50%

doing the same for openqa-piworker now.

Actions #30

Updated by nicksinger 16 days ago

  • Copied to action #180122: Run openqa-piworker as part of our infrastructure while still being CC compliant added
Actions #31

Updated by nicksinger 16 days ago · Edited

  • Status changed from In Progress to Resolved

So diesel, petrol, sapworker1 and monitor are all back in salt for quite some days already without any wireguard configured. For piworker I created #180122 as agreed today in Jitsi and kept it disabled in salt (see rollback steps in the other ticket). Also no related silences found on https://monitor.qa.suse.de/alerting/silences

Actions #32

Updated by okurz 16 days ago

  • Status changed from Resolved to In Progress

please look into the rollback actions. https://monitor.qa.suse.de/alerting/silences still mentions "Systemd services" and also it says that there are 2 "Alerts silenced" so I did not just unsilence that

Actions #33

Updated by livdywan 15 days ago

  • Description updated (diff)
  • Due date changed from 2025-04-04 to 2025-04-08
  • Status changed from In Progress to Feedback

okurz wrote in #note-32:

please look into the rollback actions. https://monitor.qa.suse.de/alerting/silences still mentions "Systemd services" and also it says that there are 2 "Alerts silenced" so I did not just unsilence that

Where did or do you see 2 alerts?

I removed the systemd silence which was also mentioned here.

Actions #34

Updated by okurz 15 days ago

livdywan wrote in #note-33:

okurz wrote in #note-32:

please look into the rollback actions. https://monitor.qa.suse.de/alerting/silences still mentions "Systemd services" and also it says that there are 2 "Alerts silenced" so I did not just unsilence that

Where did or do you see 2 alerts?

On that very page in the column "Alerts silenced"

Actions #35

Updated by livdywan 15 days ago

  • Status changed from Feedback to Resolved

On that very page in the column "Alerts silenced"

I can see two silences with active alerts for #167057 and #178492 respectively right now.

#167057#note-7 seems to explain why we have an active alert for a closed ticket. So neither is relevant for this ticket.

Actions #36

Updated by nicksinger 15 days ago

  • Related to action #180194: Workers diesel and petrol are down which blocks the salt deployment pipeline size:S added
Actions #37

Updated by okurz 15 days ago

  • Due date deleted (2025-04-08)
Actions

Also available in: Atom PDF