Actions
action #73333
closedFailed systemd services alert (workers) flaky
Start date:
2020-10-14
Due date:
2020-10-22
% Done:
0%
Estimated time:
Tags:
Description
Observation¶
In the last 12h we had quite some alerts for failing systemd services on the worker a host. Looking at https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&panelId=6&fullscreen&edit&tab=alert&from=1602604933324&to=1602655964690 it seems like one service is repeatedly failing and recovering. The alert stated values for systemd_failed.sum between 1.2 and 0.167 which I find kind of confusing and is a result how we sample the data
Expected result¶
- Alert does not fail on flaky user@486.service on staging-1.qa.suse.de
Suggestions¶
- check on staging-1.qa.suse.de why service "user@486.service" is failing, e.g.
journalctl -u user@486.service
, and fix that or prevent the alert, e.g. by disabling/masking telegraf on that host.
Actions