action #73333
closedFailed systemd services alert (workers) flaky
0%
Description
Observation¶
In the last 12h we had quite some alerts for failing systemd services on the worker a host. Looking at https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&panelId=6&fullscreen&edit&tab=alert&from=1602604933324&to=1602655964690 it seems like one service is repeatedly failing and recovering. The alert stated values for systemd_failed.sum between 1.2 and 0.167 which I find kind of confusing and is a result how we sample the data
Expected result¶
- Alert does not fail on flaky user@486.service on staging-1.qa.suse.de
Suggestions¶
- check on staging-1.qa.suse.de why service "user@486.service" is failing, e.g.
journalctl -u user@486.service
, and fix that or prevent the alert, e.g. by disabling/masking telegraf on that host.
Updated by nicksinger about 4 years ago
the failing service in question is user@486.service
on staging-1. We might still need to refine this alert to provide better insights
Updated by okurz about 4 years ago
- Tags changed from alert to alert, monitoring, systemd
- Project changed from openQA Project (public) to openQA Infrastructure (public)
- Description updated (diff)
- Category deleted (
Regressions/Crashes) - Status changed from New to Workable
- Priority changed from Normal to Urgent
We should prevent the alert failing on that specific host to remove the urgency for now. Next step after that would be to improve the information we have available in the alert. However I would be ok to call this ticket done after we resolved the immediate issue except someone has a really good idea of how to improve the alert. We probably all agree that it could look better but wishful thinking also does not bring us further ;)
Updated by okurz about 4 years ago
- Due date set to 2020-10-22
- Status changed from Workable to Feedback
- Assignee set to okurz
- Priority changed from Urgent to High
created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/383 to prevent the confusion due to the name of the alert and monitoring panel.
I assume that mkittler applied a complete "webui salt state" on that machine and that included even something like /etc/cron.d/SLES.CRON running some cron jobs which likely fail.
I removed
roles:
- webui
from /etc/salt/grains, restarted salt-master service and did rm /etc/cron.d/SLES.CRON
. Setting to feedback and waiting if the alert still triggers.
Updated by okurz about 4 years ago
I have another change prepared to cover the suggestions from nsinger to improve the output in case of alerts. But I can not push these changes to gitlab because my account seems to be blocked now, reported in https://infra.nue.suse.com/SelfService/Display.html?id=178670
Updated by okurz about 4 years ago
- Status changed from Feedback to Resolved
All necessary changes are in and no alert triggered again.