Project

General

Profile

Actions

action #107152

closed

[osd] failing systemd services on "grenache-1": "openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, …" size:M

Added by okurz about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2022-02-18
Due date:
% Done:

0%

Estimated time:

Description

Observation

from https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1

failing services

openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, openqa-reload-worker-auto-restart@27

Suggestions

  • Find out the failure reasons (worker by worker)
  • systemctl reset-failed can reset it once but we should also extend our process descriptions on the wiki or extend salt recipes or something
Actions #2

Updated by okurz about 2 years ago

  • Subject changed from [osd] failing systemd services on "grenache-1": "openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, …" to [osd] failing systemd services on "grenache-1": "openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, …" size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by mkittler about 2 years ago

  • Assignee set to mkittler
Actions #4

Updated by mkittler about 2 years ago

grenache-1 is actually currently offline due to the server migration. (Matthias wrote "Grenache will be stopped now" on 10:56 AM.) Maybe I'll find something in the logs after it is back. Maybe these failures are also caused by some other work in the labs that happened before grenache-1 was stopped.

Actions #5

Updated by mkittler about 2 years ago

  • Status changed from Workable to Feedback

It is back again. All of the workers failed because they've been masked but the corresponding reload/path unit (e.g. openqa-reload-worker-auto-restart@10.path/openqa-reload-worker-auto-restart@10.path) were not masked as well and thus the masked unit was still attempted to be reloaded which failed.

I checked all masked worker units via systemctl list-unit-files --state=masked and masked the corresponding reload units via sudo systemctl mask openqa-reload-worker-auto-restart@{10,21,22,23,25,27}.{service,path}. This should fix the issue. Of course we need to take that into account as well when unmasking the units again. Using sudo systemctl unmask openqa{,-reload}-worker-auto-restart@{10,21,22,23,25,27}.{service,path} for that should do the trick.

Actions #6

Updated by mkittler about 2 years ago

I've improved the documentation to clarify the steps for masking worker services in our setup: https://github.com/os-autoinst/openQA/pull/4519, https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/655

That's all I'd do for the sake of this issue. (Simplifying the architecture by implementing a file system watch within the worker itself and not relying on two additional systemd units is likely out of scope here.)

Actions #7

Updated by mkittler about 2 years ago

  • Status changed from Feedback to Resolved

The documentation changes have been merged and no services are failing anymore. I think that's enough for the alert handling.

Actions

Also available in: Atom PDF