action #107152
closed
[osd] failing systemd services on "grenache-1": "openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, …" size:M
Added by okurz about 2 years ago.
Updated about 2 years ago.
Description
Observation¶
from https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1
failing services
openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, openqa-reload-worker-auto-restart@27
Suggestions¶
- Find out the failure reasons (worker by worker)
systemctl reset-failed
can reset it once but we should also extend our process descriptions on the wiki or extend salt recipes or something
- Subject changed from [osd] failing systemd services on "grenache-1": "openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, …" to [osd] failing systemd services on "grenache-1": "openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, …" size:M
- Description updated (diff)
- Status changed from New to Workable
grenache-1
is actually currently offline due to the server migration. (Matthias wrote "Grenache will be stopped now" on 10:56 AM.) Maybe I'll find something in the logs after it is back. Maybe these failures are also caused by some other work in the labs that happened before grenache-1
was stopped.
- Status changed from Workable to Feedback
It is back again. All of the workers failed because they've been masked but the corresponding reload/path unit (e.g. openqa-reload-worker-auto-restart@10.path
/openqa-reload-worker-auto-restart@10.path
) were not masked as well and thus the masked unit was still attempted to be reloaded which failed.
I checked all masked worker units via systemctl list-unit-files --state=masked
and masked the corresponding reload units via sudo systemctl mask openqa-reload-worker-auto-restart@{10,21,22,23,25,27}.{service,path}
. This should fix the issue. Of course we need to take that into account as well when unmasking the units again. Using sudo systemctl unmask openqa{,-reload}-worker-auto-restart@{10,21,22,23,25,27}.{service,path}
for that should do the trick.
- Status changed from Feedback to Resolved
The documentation changes have been merged and no services are failing anymore. I think that's enough for the alert handling.
Also available in: Atom
PDF