Project

General

Profile

Actions

action #175629

open

coordination #161414: [epic] Improved salt based infrastructure management

diesel+petrol (possibly all ppc64le OPAL machines) often run into salt error "Not connected" or "No response" due to wireguard services failing to start on boot size:S

Added by okurz 15 days ago. Updated 10 days ago.

Status:
Workable
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
2025-01-16
Due date:
% Done:

0%

Estimated time:

Description

Observation

For example

diesel.qe.nue2.suse.org:
    Minion did not return. [Not connected]
petrol.qe.nue2.suse.org:
    Minion did not return. [No response]

in https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3666125#L433

Acceptance criteria

  • AC1: All OSD salt controlled hosts using wireguard consistently bring up wireguard after every boot

Suggestions

  • Identify and fix the problem, maybe systemd service dependency cycle?
  • Consider to inform SUSE-IT upstream about the problem
  • How about a systemd service override waiting for proper network before continuing to bring up wireguard so that we don't care that much about systemd service requirements on "network-online" which is never reliable, see https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ in particular "What does this mean for me, a Developer?". Consider an additional "ExecStart=…" and repeat the existing ExecStart from systemctl cat wg-quick@.service with the "…" being some call to wait for proper network before continuing
  • Use https://github.com/os-autoinst/scripts/blob/master/reboot-stability-check to ensure stability over reboots of hosts using wireguard, e.g. diesel, petrol, sapworker1

Related issues 3 (0 open3 closed)

Related to openQA Infrastructure (public) - action #175686: OSD webUI ended up with "502 Bad Gateway" from nginx on 2025-01-17, needed manual restart of openqa-webuiResolvedokurz2025-01-17

Actions
Copied from openQA Infrastructure (public) - action #175407: salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:SResolvedokurz

Actions
Copied to openQA Infrastructure (public) - action #175740: [alert] deploy pipeline for salt-states-openqa failed, multiple host run into salt error "Not connected" or "No response"Resolvedokurz2025-01-16

Actions
Actions #1

Updated by okurz 15 days ago

  • Copied from action #175407: salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:S added
Actions #2

Updated by okurz 14 days ago

  • Related to action #175686: OSD webUI ended up with "502 Bad Gateway" from nginx on 2025-01-17, needed manual restart of openqa-webui added
Actions #3

Updated by okurz 14 days ago

  • Status changed from New to In Progress
  • Assignee set to okurz

I wanted to check the stability of applying salt states with "count-fail-ratio" but encountered a problem in retry which I fixed in https://github.com/okurz/retry/pull/8
and submitted to Factory with https://build.opensuse.org/requests/1238226 (accepted) and also with osc mr -m "Update" openSUSE:Factory retry openSUSE:Backports:SLE-15-SP5:Update an update to 15.5 which should also be inherited to 15.6 -> https://build.opensuse.org/requests/1238439

right now on petrol I saw that multiple services failed:

● auto-update.service                  loaded failed failed Automatically patch system packages.
● openqa-worker-auto-restart@8.service loaded failed failed openQA Worker #8
● openqa-worker-auto-restart@9.service loaded failed failed openQA Worker #9
● prg2wg-restart.service               loaded failed failed Restart WireGuard service
● wg-quick@prg2wg.service              loaded failed failed WireGuard via wg-quick(8) for prg2wg

root problem is probably wg-quick@prg2wg.service. journalctl -u wg-quick@prg2wg.service shows

-- Boot 963375f1b4f04c308ae2ca3b33879654 --
Jan 15 10:08:41 petrol systemd[1]: Starting WireGuard via wg-quick(8) for prg2wg...
Jan 15 10:08:42 petrol wg-quick[3228]: [#] systemctl enable wg-quick@prg2wg
Jan 15 10:08:42 petrol wg-quick[3228]: [#] ip link add prg2wg type wireguard
Jan 15 10:08:42 petrol wg-quick[3228]: [#] wg setconf prg2wg /dev/fd/63
Jan 15 10:08:43 petrol wg-quick[3429]: Name or service not known: `wggw.prg2.suse.org:52823'
Jan 15 10:08:43 petrol wg-quick[3429]: Configuration parsing error
Jan 15 10:08:43 petrol wg-quick[3228]: [#] ip link delete dev prg2wg
Jan 15 10:08:43 petrol systemd[1]: wg-quick@prg2wg.service: Main process exited, code=exited, status=1/FAILURE
Jan 15 10:08:43 petrol systemd[1]: wg-quick@prg2wg.service: Failed with result 'exit-code'.
Jan 15 10:08:43 petrol systemd[1]: Failed to start WireGuard via wg-quick(8) for prg2wg.
Actions #4

Updated by okurz 14 days ago

  • Assignee changed from okurz to nicksinger
Actions #6

Updated by nicksinger 14 days ago

  • Status changed from In Progress to Blocked

We looked at the service and found that the restart-service apparently only tries restarting once. Early at boot there is no DNS available and the service never retries after that point. I've added a similar mechanism to salt as we already have for NFS to resolve the wg-endpoint at each salt run and write the result in /etc/hosts. This way the system can resolve the endpoint in early boot but should receive updates on DNS changes.

I already tested this change on diesel and wg was up after a reboot. I will open a MR once gitlab is back online.

Actions #7

Updated by nicksinger 13 days ago

  • Status changed from Blocked to Feedback
Actions #8

Updated by jbaier_cz 13 days ago

  • Copied to action #175740: [alert] deploy pipeline for salt-states-openqa failed, multiple host run into salt error "Not connected" or "No response" added
Actions #9

Updated by nicksinger 10 days ago · Edited

after struggling with several unrelated changes in the pipeline I had to add another fix here as well https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1349 - After this is merged and the deployment works again, I will assign to @okurz so you can run benchmarks mentioned in https://progress.opensuse.org/issues/175629#note-3

Actions #10

Updated by okurz 10 days ago

  • Status changed from Feedback to New

From today sapworker1 which has the entry in /etc/hosts but apparently still tries to check the network even before the network is available

sapworker1:~ # journalctl -b -u wg-quick@prg2wg.service
Jan 19 03:35:08 sapworker1 systemd[1]: Starting WireGuard via wg-quick(8) for prg2wg...
Jan 19 03:35:08 sapworker1 wg-quick[7377]: [#] systemctl enable wg-quick@prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip link add prg2wg type wireguard
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] wg setconf prg2wg /dev/fd/63
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -4 address add 10.144.169.4 dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -6 address add 2a07:de40:b21b:3:10:144:169:4/128 dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[10079]: RTNETLINK answers: Network is unreachable
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip link set mtu 1420 up dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b21a:1:10:144:174:14/128 dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b2a0:1::/64 dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b281:80::/64 dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b280:84::/64 dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b280:82::/64 dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b280:53::/64 dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b280:21::/64 dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b280:20::/64 dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b280:19::/64 dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b280:14::/64 dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b224:3::/64 dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b224:2::/64 dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b224:1::/64 dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b205:7::/64 dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b205:1c::/64 dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b205:19::/64 dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b205:15::/64 dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b205:12::/64 dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b204:8::/64 dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b204:6::/64 dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b204:4::/64 dev prg2wg
Jan 19 03:35:12 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b204:14::/64 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b203:12::/64 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b200:137::/64 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b200:136::/64 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b281:60::/59 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -6 route add 2a07:de40:b281:100::/59 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.144.174.14/32 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.151.20.32/27 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.151.20.0/27 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.150.16.0/27 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.144.9.128/27 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.144.53.64/27 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.144.53.32/27 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.144.39.128/27 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.144.38.128/27 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.144.37.192/27 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.145.136.64/26 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.145.136.0/26 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.145.52.0/25 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.145.48.0/25 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.144.9.0/25 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.151.53.0/24 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.151.19.0/24 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.151.14.0/24 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.145.56.0/24 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.145.54.0/24 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.145.50.0/24 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.145.10.0/24 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.144.8.0/24 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.150.64.0/20 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ip -4 route add 10.150.128.0/20 dev prg2wg
Jan 19 03:35:13 sapworker1 wg-quick[7377]: [#] ping -c 3 10.144.174.14
Jan 19 03:35:25 sapworker1 wg-quick[10435]: PING 10.144.174.14 (10.144.174.14) 56(84) bytes of data.
Jan 19 03:35:25 sapworker1 wg-quick[10435]: --- 10.144.174.14 ping statistics ---
Jan 19 03:35:25 sapworker1 wg-quick[10435]: 3 packets transmitted, 0 received, 100% packet loss, time 2040ms
Jan 19 03:35:25 sapworker1 wg-quick[7377]: [#] ip link delete dev prg2wg
Jan 19 03:35:25 sapworker1 systemd[1]: wg-quick@prg2wg.service: Main process exited, code=exited, status=1/FAILURE
Jan 19 03:35:25 sapworker1 systemd[1]: wg-quick@prg2wg.service: Failed with result 'exit-code'.
Jan 19 03:35:25 sapworker1 systemd[1]: Failed to start WireGuard via wg-quick(8) for prg2wg.

one thing that could explain it is a systemd dependency cycle but I could not find it there.

Actions #11

Updated by okurz 10 days ago

  • Subject changed from diesel+petrol (possibly all ppc64le OPAL machines) often run into salt error "Not connected" or "No response" to diesel+petrol (possibly all ppc64le OPAL machines) often run into salt error "Not connected" or "No response" due to wireguard services failing to start on boot size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions

Also available in: Atom PDF