action #107152: [osd] failing systemd services on "grenache-1": "openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, …" size:M - openQA Infrastructure - openSUSE Project Management Tool

Actions

Copy link

action #107152

closed

[osd] failing systemd services on "grenache-1": "openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, …" size:M

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

mkittler

Category:

Target version:

openQA Project - Ready

Start date:

2022-02-18

Due date:

% Done:

Estimated time:

Description

Observation¶

from https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1

failing services

openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, openqa-reload-worker-auto-restart@27

Suggestions¶

Find out the failure reasons (worker by worker)
systemctl reset-failed can reset it once but we should also extend our process descriptions on the wiki or extend salt recipes or something

Actions

Copy link

Updated by okurz over 2 years ago

Subject changed from [osd] failing systemd services on "grenache-1": "openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, …" to [osd] failing systemd services on "grenache-1": "openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, …" size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by mkittler over 2 years ago

Assignee set to mkittler

Actions

Copy link

Updated by mkittler over 2 years ago

grenache-1 is actually currently offline due to the server migration. (Matthias wrote "Grenache will be stopped now" on 10:56 AM.) Maybe I'll find something in the logs after it is back. Maybe these failures are also caused by some other work in the labs that happened before grenache-1 was stopped.

Actions

Copy link

Updated by mkittler over 2 years ago

Status changed from Workable to Feedback

It is back again. All of the workers failed because they've been masked but the corresponding reload/path unit (e.g. openqa-reload-worker-auto-restart@10.path/openqa-reload-worker-auto-restart@10.path) were not masked as well and thus the masked unit was still attempted to be reloaded which failed.

I checked all masked worker units via systemctl list-unit-files --state=masked and masked the corresponding reload units via sudo systemctl mask openqa-reload-worker-auto-restart@{10,21,22,23,25,27}.{service,path}. This should fix the issue. Of course we need to take that into account as well when unmasking the units again. Using sudo systemctl unmask openqa{,-reload}-worker-auto-restart@{10,21,22,23,25,27}.{service,path} for that should do the trick.

Actions

Copy link

Updated by mkittler over 2 years ago

I've improved the documentation to clarify the steps for masking worker services in our setup: https://github.com/os-autoinst/openQA/pull/4519, https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/655

That's all I'd do for the sake of this issue. (Simplifying the architecture by implementing a file system watch within the worker itself and not relying on two additional systemd units is likely out of scope here.)

Actions

Copy link

Updated by mkittler over 2 years ago

Status changed from Feedback to Resolved

The documentation changes have been merged and no services are failing anymore. I think that's enough for the alert handling.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project » openQA Infrastructure

Tags

Custom queries

action #107152

[osd] failing systemd services on "grenache-1": "openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, …" size:M

Observation¶

Suggestions¶

Updated by okurz over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago