Project

General

Profile

action #73333

Updated by okurz 4 months ago

## Observation
In the last 12h we had quite some alerts for failing systemd services on ~~the worker~~ a host. the worker. Looking at https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&panelId=6&fullscreen&edit&tab=alert&from=1602604933324&to=1602655964690 it seems like one service is repeatedly failing and recovering. The alert stated values for systemd_failed.sum between 1.2 and 0.167 which I find kind of confusing and is a result how we sample the data

## Expected result
* Alert does not fail on flaky user@486.service on staging-1.qa.suse.de

## Suggestions
* check on staging-1.qa.suse.de why service "user@486.service" is failing, e.g. `journalctl -u user@486.service`, and fix that or prevent the alert, e.g. by disabling/masking telegraf on that host.

Back