action #107152
closed[osd] failing systemd services on "grenache-1": "openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, …" size:M
0%
Description
Observation¶
from https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1
failing services
openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, openqa-reload-worker-auto-restart@27
Suggestions¶
- Find out the failure reasons (worker by worker)
systemctl reset-failed
can reset it once but we should also extend our process descriptions on the wiki or extend salt recipes or something
Updated by okurz over 2 years ago
- Subject changed from [osd] failing systemd services on "grenache-1": "openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, …" to [osd] failing systemd services on "grenache-1": "openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, …" size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler over 2 years ago
grenache-1
is actually currently offline due to the server migration. (Matthias wrote "Grenache will be stopped now" on 10:56 AM.) Maybe I'll find something in the logs after it is back. Maybe these failures are also caused by some other work in the labs that happened before grenache-1
was stopped.
Updated by mkittler over 2 years ago
- Status changed from Workable to Feedback
It is back again. All of the workers failed because they've been masked but the corresponding reload/path unit (e.g. openqa-reload-worker-auto-restart@10.path
/openqa-reload-worker-auto-restart@10.path
) were not masked as well and thus the masked unit was still attempted to be reloaded which failed.
I checked all masked worker units via systemctl list-unit-files --state=masked
and masked the corresponding reload units via sudo systemctl mask openqa-reload-worker-auto-restart@{10,21,22,23,25,27}.{service,path}
. This should fix the issue. Of course we need to take that into account as well when unmasking the units again. Using sudo systemctl unmask openqa{,-reload}-worker-auto-restart@{10,21,22,23,25,27}.{service,path}
for that should do the trick.
Updated by mkittler over 2 years ago
I've improved the documentation to clarify the steps for masking worker services in our setup: https://github.com/os-autoinst/openQA/pull/4519, https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/655
That's all I'd do for the sake of this issue. (Simplifying the architecture by implementing a file system watch within the worker itself and not relying on two additional systemd units is likely out of scope here.)
Updated by mkittler over 2 years ago
- Status changed from Feedback to Resolved
The documentation changes have been merged and no services are failing anymore. I think that's enough for the alert handling.