action #178576: Workers unresponsive in salt pipelines including openqa-piworker, sapworker1 and monitor size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #178576

closed

coordination #161414: [epic] Improved salt based infrastructure management

Workers unresponsive in salt pipelines including openqa-piworker, sapworker1 and monitor size:S

Added by livdywan about 1 month ago. Updated 15 days ago.

Status:

Resolved

Priority:

High

Assignee:

nicksinger

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2025-03-07

Due date:

% Done:

Estimated time:

Tags:

alert, infra, reactive work

Description

Observation¶

See https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3959809

openqa-piworker.qe.nue2.suse.org:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20250310104340880948
sapworker1.qe.nue2.suse.org:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20250310104340880948
monitor.qe.nue2.suse.org:
    Minion did not return. [No response]

Suggestions¶

This seems to be reproducible

Rollback actions¶

DONE Add back to salt and production: diesel,petrol,monitor,sapworker1,openqa-piworker for i in diesel.qe.nue2.suse.org petrol.qe.nue2.suse.org monitor.qe.nue2.suse.org sapworker1.qe.nue2.suse.org openqa-piworker.qe.nue2.suse.org ; do sudo salt-key -y -a $i; done
DONE Remove silence Systemd services from https://monitor.qa.suse.de/alerting/silences

Related issues 3 (2 open — 1 closed)

Actions

Copy link

Updated by nicksinger about 1 month ago

Status changed from New to In Progress
Assignee set to nicksinger

Actions

Copy link

Updated by nicksinger about 1 month ago

I already checked network performance between petrol<->OSD as part of https://progress.opensuse.org/issues/178567 but found no apparent problem. @jbaier_cz mentioned a recent "Network Maintenance" last Friday (https://suse.slack.com/archives/C02AET1AAAD/p1741268343150269) which sounds highly related but not confirmed. To get further assistance I created https://sd.suse.com/servicedesk/customer/portal/1/SD-182364

Actions

Copy link

Updated by openqa_review about 1 month ago

Due date set to 2025-03-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by nicksinger about 1 month ago · Edited

Status changed from In Progress to Feedback

I used sapworker1 as reproducer for the issue at hand while having a terminal open with salt-minion -l debug (stopped the service before).
Logs showed a zeromq connection to OSD TCP/4505 while OSD connects back to the worker at TCP/4506. I used salt sapworker1.qe.nue2.suse.org saltutil.sync_grains on OSD to quickly trigger the issue. Salt looked fine but all of a sudden it stopped working and refused to talk back to OSD answering the "sync_grains"-request. Using wireshark and the remote-ssh-tracing, I found a lot of "TCP retransmissions" on the prg2wg interface. A noticeable pattern was the last packet before retransmissions always showing the max MTU of 1420 (the default of a wg-interface). This hinted to MTU problems. Using ping -I prg2wg -M do -s $((1420-28)) openqa.suse.de I was quickly able to reproduce the problem again - packets go out but no answer comes back (not even an ICMP-message signaling a need for fragmentation). Slowly lowering this 1420 bytes I was able to find the maximum path size to be 1284 with ping/ICMP, so 28 bytes more for TCP, meaning 1312 for the prg2wg-interface. I added my findings to the SD ticket because I still suspect the recent network changes to be the cause of this ("IPSec tunnels got adjusted" -> each tunnel usually lowers MTUs). As a workaround or possible configuration fix, I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1404 which currently fails pipelines because of #178564

Actions

Copy link

Updated by livdywan about 1 month ago

Subject changed from Workers unresponsive in salt pipelines including openqa-piworker, sapworker1 and monitor to Workers unresponsive in salt pipelines including openqa-piworker, sapworker1 and monitor size:S

Actions

Copy link

Updated by nicksinger about 1 month ago

Priority changed from High to Low

I applied the fix manually on all NUE2 machines with a wg tunnel. With that https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3964772 passed and salt is able to properly deploy again. I'm keeping the MR (and ticket here) open until #178564 is fixed because I would like to see proper tests first.

Actions

Copy link

Updated by nicksinger about 1 month ago

Status changed from Feedback to In Progress
Priority changed from Low to High

Merged but unfortunately running on too many workers and therefore breaking our pipeline: https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/4007282#L154 - trying to come up with a fix.

Actions

Copy link

Updated by nicksinger about 1 month ago

Due date deleted (~~2025-03-25~~)
Status changed from In Progress to Resolved

Deployment of https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1404 failed due to missing needs_wireguard-grains on backup-vm.qe.nue2.suse.org and baremetal-support.qe.nue2.suse.org - I now set it to "False" on these hosts. With a subsequent deployment in a different MR, the change got applied to all machines:

openqa:~ # salt -C '*.nue2.suse.org and not G@needs_wireguard:False' cmd.run 'cat /etc/wireguard/prg2wg.conf | grep -i mtu'
monitor.qe.nue2.suse.org:
    MTU = 1292
petrol.qe.nue2.suse.org:
    MTU = 1292
diesel.qe.nue2.suse.org:
    MTU = 1292
openqa-piworker.qe.nue2.suse.org:
    MTU = 1292
sapworker1.qe.nue2.suse.org:
    MTU = 1292

Actions

Copy link

#10

Updated by okurz about 1 month ago · Edited

Status changed from Resolved to In Progress

As discussed with nicksinger reopening. There are multiple pipelines either stalling and running into 2h timeout or weird salt issues: https://gitlab.suse.de/openqa/salt-states-openqa/-/pipelines

I tried with sudo count-fail-ratio salt \* saltutil.sync_grains and found no failure: count-fail-ratio: Run: 20. Fails: 0. Fail ratio 0±0%. No fails, computed failure probability < 15.00%. Probably we need to apply the complete high state.

Actions

Copy link

#11

Updated by okurz about 1 month ago

Parent task set to #161414

Actions

Copy link

#12

Updated by okurz about 1 month ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1413 (merged). Next step: Contact IT regarding expected MTU size. I will report a separate ticket about better MTU based monitoring.

Actions

Copy link

#13

Updated by okurz about 1 month ago

Copied to action #179302: Better monitoring for correct MTU size limits added

Actions

Copy link

#14

Updated by okurz about 1 month ago

Description updated (diff)

I tried to manually ensure that the MTU of 1272 is deployed on all relevant machines and partially triggered reboots. Over salt the machines are reachable but over ssh only with a delay of multiple minutes, possibly as the lower MTU is not effective yet. Also couldn't find a stable way to start/restart the wireguard tunnels. Also getting the error "RTNETLINK answers: No such device". Could it be that https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1404/ removed some devices or something?
Anyway, to not constantly fail salt pipelines I have for now removed all NUE2 wg hosts from salt and added a silence. Mind the corresponding rollback actions please

Actions

Copy link

#15

Updated by okurz about 1 month ago

Description updated (diff)

Actions

Copy link

#16

Updated by openqa_review about 1 month ago

Due date set to 2025-04-04

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#17

Updated by okurz 30 days ago

Priority changed from High to Urgent

From mmoese in https://suse.slack.com/archives/C02CANHLANP/p1742810337597879 #eng-testing

@qa-tools I see a lot of jobs being abandoned by sapworker1 (like https://openqa.suse.de/tests/17128750 )

I am convinced this is related to this ticket, raising to "Urgent"

Actions

Copy link

#18

Updated by nicksinger 30 days ago

I changed the MTU for diesel, monitor, openqa-piworker, petrol and sapworker1 according to the output of echo /etc/sysconfig/network/ifcfg-`ip -j r s default | jq -r ".[0].dev"` and restarted the network-service on them. I determined this again by using ping -6 -I em1 -M do -s $((1360-8-40)) openqa.suse.de (on sapworker1) but using the "outer interface" this time. Removing the "PreUp="- and "PostUp="-statements in /etc/wireguard/prg2wg.conf makes the tunnel restart way faster. I also removed the "MTU="-statement again as the PMTU-discovery (or auto-calculation, not sure) should work again properly. After restarting the wg-interface, this is indeed the case:

sapworker1:/etc/wireguard # ip link show dev prg2wg
154: prg2wg: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1280 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/none
sapworker1:/etc/wireguard # ping -6 -I prg2wg -M do -s $((1280-8-40)) openqa.suse.de
PING openqa.suse.de (2a07:de40:b203:12:0:ff:fe4f:7c2b) from 2a07:de40:b21b:3:10:144:169:4 prg2wg: 1232 data bytes
1240 bytes from openqa.suse.de (2a07:de40:b203:12:0:ff:fe4f:7c2b): icmp_seq=1 ttl=62 time=5.94 ms
1240 bytes from openqa.suse.de (2a07:de40:b203:12:0:ff:fe4f:7c2b): icmp_seq=2 ttl=62 time=5.85 ms
1240 bytes from openqa.suse.de (2a07:de40:b203:12:0:ff:fe4f:7c2b): icmp_seq=3 ttl=62 time=5.75 ms
^C
--- openqa.suse.de ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 5.746/5.846/5.940/0.079 ms

I've accepted the salt-key for monitor again on OSD and test with this machine now if everything is stable again.

Actions

Copy link

#19

Updated by nicksinger 30 days ago

Seems to work. Before moving on to adding this into our salt, I verify now with openqa:/srv/salt/wireguard # runs=100 count-fail-ratio salt 'monitor.qe.nue2.suse.org' state.apply running in tmux session 0 on OSD.

Actions

Copy link

#20

Updated by nicksinger 29 days ago

finished with:

## count-fail-ratio: Run: 100. Fails: 7. Fail ratio 7.00±5.00%
## mean runtime: 100813±26821.26 ms

The fails are all instances of #176949 so going forward with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1416 should be fine. I didn't add them to salt yet because the tunnel MTU in salt would break them anyway.

Actions

Copy link

#21

Updated by livdywan 28 days ago

See team chat

Reintroduce explicit MTU setting for wireguard tunnels: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1420 (merged)

Actions

Copy link

#22

Updated by okurz 28 days ago

Failures of the day:

https://gitlab.suse.de/openqa/osd-deployment/-/jobs/4041080#L6111 showing sapworker1 stuck as "System management is locked by the application with pid 42111 (zypper)". As I can't even login normally over ssh (takes long and only "reacts" on ctrl-c, showing a bash prompt without proper PS1. Or using ssh -4 which also takes long but eventually reacts) I assume this is related to this ticket
https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/4039441 running into 2h timeout

Actions

Copy link

#23

Updated by nicksinger 27 days ago

Priority changed from Urgent to High

So for now most of the machines are online and in salt again without wireguard enabled but with proper MTU settings. It turned out that 1280 is the minimum MTU required for v6 to work. This is the reason why https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1420 broke a lot over night. I run an experiment on sapworker1 with having NFS disabled. With "openqa_share_nfs: False" in /etc/salt/grains we can enable sapworker1 in salt again. https://openqa.suse.de/tests/17138601 is the first test in this setup. https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/4044163#L115 is failing again due to XML-errors but passed sapworker1 quite fast at least.
The same for diesel. I will apply the same on petrol tomorrow after reviewing some openQA jobs running on these machines.

Actions

Copy link

#24

Updated by okurz 22 days ago

Priority changed from High to Urgent

https://gitlab.suse.de/openqa/osd-deployment/-/jobs/4066601 from today, raising to "urgent" again

Actions

Copy link

#25

Updated by livdywan 21 days ago

openqa-piworker.qe.nue2.suse.org:
Minion did not return. [Not connected]

Briefly checked in the daily. @okurz will mitigate by dropping piworker from salt to alleviate the urgency

Actions

Copy link

#26

Updated by okurz 21 days ago

Assignee changed from nicksinger to okurz

I removed openqa-piworker from salt again as it was blocking https://gitlab.suse.de/openqa/osd-deployment/-/jobs/4080187 . Same for petrol for https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/4063615#L478.
The machines are still listed in "rollback actions" so good enough. I retriggered the failed osd-deployment and salt-states and will monitor it

Actions

Copy link

#27

Updated by okurz 21 days ago

Assignee changed from okurz to nicksinger
Priority changed from Urgent to High

https://gitlab.suse.de/openqa/salt-states-openqa/-/pipelines/1653965 and https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/1653593 passed. Back to @nicksinger

Actions

Copy link

#28

Updated by nicksinger 19 days ago

diesel was disabled too and I enabled it again already 2 days ago and verified with count-fail-ratio again (had 100% success). Doing the same for diesel now currently at 100% success on 14 runs - looks promising.

Actions

Copy link

#29

Updated by nicksinger 19 days ago

openqa:~ # runs=40 count-fail-ratio salt --static 'petrol.qe.nue2.suse.org' state.apply
[…]
## count-fail-ratio: Run: 40. Fails: 0. Fail ratio 0±0%. No fails, computed failure probability < 7.50%

doing the same for openqa-piworker now.

Actions

Copy link

#30

Updated by nicksinger 16 days ago

Copied to action #180122: Run openqa-piworker as part of our infrastructure while still being CC compliant added

Actions

Copy link

#31

Updated by nicksinger 16 days ago · Edited

Status changed from In Progress to Resolved

So diesel, petrol, sapworker1 and monitor are all back in salt for quite some days already without any wireguard configured. For piworker I created #180122 as agreed today in Jitsi and kept it disabled in salt (see rollback steps in the other ticket). Also no related silences found on https://monitor.qa.suse.de/alerting/silences

Actions

Copy link

#32

Updated by okurz 16 days ago

Status changed from Resolved to In Progress

please look into the rollback actions. https://monitor.qa.suse.de/alerting/silences still mentions "Systemd services" and also it says that there are 2 "Alerts silenced" so I did not just unsilence that

Actions

Copy link

#33

Updated by livdywan 15 days ago

Description updated (diff)
Due date changed from 2025-04-04 to 2025-04-08
Status changed from In Progress to Feedback

okurz wrote in #note-32:

please look into the rollback actions. https://monitor.qa.suse.de/alerting/silences still mentions "Systemd services" and also it says that there are 2 "Alerts silenced" so I did not just unsilence that

Where did or do you see 2 alerts?

I removed the systemd silence which was also mentioned here.

Actions

Copy link

#34

Updated by okurz 15 days ago

livdywan wrote in #note-33:

okurz wrote in #note-32:

please look into the rollback actions. https://monitor.qa.suse.de/alerting/silences still mentions "Systemd services" and also it says that there are 2 "Alerts silenced" so I did not just unsilence that

Where did or do you see 2 alerts?

On that very page in the column "Alerts silenced"

Actions

Copy link

#35

Updated by livdywan 15 days ago

Status changed from Feedback to Resolved

On that very page in the column "Alerts silenced"

I can see two silences with active alerts for #167057 and #178492 respectively right now.

#167057#note-7 seems to explain why we have an active alert for a closed ticket. So neither is relevant for this ticket.

Actions

Copy link

#36

Updated by nicksinger 15 days ago

Related to action #180194: Workers diesel and petrol are down which blocks the salt deployment pipeline size:S added

Actions

Copy link

#37

Updated by okurz 15 days ago

Due date deleted (~~2025-04-08~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #178576

Workers unresponsive in salt pipelines including openqa-piworker, sapworker1 and monitor size:S

Observation¶

Suggestions¶

Rollback actions¶

Updated by nicksinger about 1 month ago

Updated by nicksinger about 1 month ago

Updated by openqa_review about 1 month ago

Updated by nicksinger about 1 month ago · Edited

Updated by livdywan about 1 month ago

Updated by nicksinger about 1 month ago

Updated by nicksinger about 1 month ago

Updated by nicksinger about 1 month ago

Updated by okurz about 1 month ago · Edited

Updated by okurz about 1 month ago

Updated by okurz about 1 month ago

Updated by okurz about 1 month ago

Updated by okurz about 1 month ago

Updated by okurz about 1 month ago

Updated by openqa_review about 1 month ago

Updated by okurz 30 days ago

Updated by nicksinger 30 days ago

Updated by nicksinger 30 days ago

Updated by nicksinger 29 days ago

Updated by livdywan 28 days ago

Updated by okurz 28 days ago

Updated by nicksinger 27 days ago

Updated by okurz 22 days ago

Updated by livdywan 21 days ago

Updated by okurz 21 days ago

Updated by okurz 21 days ago

Updated by nicksinger 19 days ago

Updated by nicksinger 19 days ago

Updated by nicksinger 16 days ago

Updated by nicksinger 16 days ago · Edited

Updated by okurz 16 days ago

Updated by livdywan 15 days ago

Updated by okurz 15 days ago

Updated by livdywan 15 days ago

Updated by nicksinger 15 days ago

Updated by okurz 15 days ago