action #107152: [osd] failing systemd services on "grenache-1": "openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, …" size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE Tools Team - Beginner
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE Tools Team - Expert
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #107152

closed

[osd] failing systemd services on "grenache-1": "openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, …" size:M

Added by okurz over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2022-02-18

Due date:

% Done:

Estimated time:

Description

Observation¶

from https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1

failing services

openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, openqa-reload-worker-auto-restart@27

Suggestions¶

Find out the failure reasons (worker by worker)
systemctl reset-failed can reset it once but we should also extend our process descriptions on the wiki or extend salt recipes or something

History
Notes
Property changes

Actions

Copy link

Updated by okurz over 3 years ago

Subject changed from [osd] failing systemd services on "grenache-1": "openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, …" to [osd] failing systemd services on "grenache-1": "openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, …" size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by mkittler over 3 years ago

Assignee set to mkittler

Actions

Copy link

Updated by mkittler over 3 years ago

grenache-1 is actually currently offline due to the server migration. (Matthias wrote "Grenache will be stopped now" on 10:56 AM.) Maybe I'll find something in the logs after it is back. Maybe these failures are also caused by some other work in the labs that happened before grenache-1 was stopped.

Actions

Copy link

Updated by mkittler over 3 years ago

Status changed from Workable to Feedback

It is back again. All of the workers failed because they've been masked but the corresponding reload/path unit (e.g. openqa-reload-worker-auto-restart@10.path/openqa-reload-worker-auto-restart@10.path) were not masked as well and thus the masked unit was still attempted to be reloaded which failed.

I checked all masked worker units via systemctl list-unit-files --state=masked and masked the corresponding reload units via sudo systemctl mask openqa-reload-worker-auto-restart@{10,21,22,23,25,27}.{service,path}. This should fix the issue. Of course we need to take that into account as well when unmasking the units again. Using sudo systemctl unmask openqa{,-reload}-worker-auto-restart@{10,21,22,23,25,27}.{service,path} for that should do the trick.

Actions

Copy link

Updated by mkittler over 3 years ago

I've improved the documentation to clarify the steps for masking worker services in our setup: https://github.com/os-autoinst/openQA/pull/4519, https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/655

That's all I'd do for the sake of this issue. (Simplifying the architecture by implementing a file system watch within the worker itself and not relying on two additional systemd units is likely out of scope here.)

Actions

Copy link

Updated by mkittler over 3 years ago

Status changed from Feedback to Resolved

The documentation changes have been merged and no services are failing anymore. I think that's enough for the alert handling.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #107152

[osd] failing systemd services on "grenache-1": "openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@25, …" size:M

Observation¶

Suggestions¶

Updated by okurz over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago