Project

General

Profile

Actions

action #73333

closed

Failed systemd services alert (workers) flaky

Added by nicksinger over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2020-10-14
Due date:
2020-10-22
% Done:

0%

Estimated time:

Description

Observation

In the last 12h we had quite some alerts for failing systemd services on the worker a host. Looking at https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&panelId=6&fullscreen&edit&tab=alert&from=1602604933324&to=1602655964690 it seems like one service is repeatedly failing and recovering. The alert stated values for systemd_failed.sum between 1.2 and 0.167 which I find kind of confusing and is a result how we sample the data

Expected result

Suggestions

  • check on staging-1.qa.suse.de why service "user@486.service" is failing, e.g. journalctl -u user@486.service, and fix that or prevent the alert, e.g. by disabling/masking telegraf on that host.
Actions

Also available in: Atom PDF