Project

General

Profile

Actions

action #175629

closed

coordination #161414: [epic] Improved salt based infrastructure management

diesel+petrol (possibly all ppc64le OPAL machines) often run into salt error "Not connected" or "No response" due to wireguard services failing to start on boot size:S

Added by okurz 22 days ago. Updated 7 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
2025-01-16
Due date:
% Done:

0%

Estimated time:

Description

Observation

For example

diesel.qe.nue2.suse.org:
    Minion did not return. [Not connected]
petrol.qe.nue2.suse.org:
    Minion did not return. [No response]

in https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3666125#L433

Acceptance criteria

  • AC1: All OSD salt controlled hosts using wireguard consistently bring up wireguard after every boot

Suggestions

  • Identify and fix the problem, maybe systemd service dependency cycle?
  • Consider to inform SUSE-IT upstream about the problem
  • How about a systemd service override waiting for proper network before continuing to bring up wireguard so that we don't care that much about systemd service requirements on "network-online" which is never reliable, see https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ in particular "What does this mean for me, a Developer?". Consider an additional "ExecStart=…" and repeat the existing ExecStart from systemctl cat wg-quick@.service with the "…" being some call to wait for proper network before continuing
  • Use https://github.com/os-autoinst/scripts/blob/master/reboot-stability-check to ensure stability over reboots of hosts using wireguard, e.g. diesel, petrol, sapworker1

Related issues 3 (0 open3 closed)

Related to openQA Infrastructure (public) - action #175686: OSD webUI ended up with "502 Bad Gateway" from nginx on 2025-01-17, needed manual restart of openqa-webuiResolvedokurz2025-01-17

Actions
Copied from openQA Infrastructure (public) - action #175407: salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:SResolvedokurz

Actions
Copied to openQA Infrastructure (public) - action #175740: [alert] deploy pipeline for salt-states-openqa failed, multiple host run into salt error "Not connected" or "No response"Resolvedokurz2025-01-16

Actions
Actions

Also available in: Atom PDF