Failed systemd services alert (workers) flaky
In the last 12h we had quite some alerts for failing systemd services on
the worker a host. Looking at https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&panelId=6&fullscreen&edit&tab=alert&from=1602604933324&to=1602655964690 it seems like one service is repeatedly failing and recovering. The alert stated values for systemd_failed.sum between 1.2 and 0.167 which I find kind of confusing and is a result how we sample the data
- Alert does not fail on flaky firstname.lastname@example.org on staging-1.qa.suse.de
- check on staging-1.qa.suse.de why service "email@example.com" is failing, e.g.
journalctl -u firstname.lastname@example.org, and fix that or prevent the alert, e.g. by disabling/masking telegraf on that host.
- Tags changed from alert to alert, monitoring, systemd
- Project changed from openQA Project to openQA Infrastructure
- Description updated (diff)
- Category deleted (
- Status changed from New to Workable
- Priority changed from Normal to Urgent
We should prevent the alert failing on that specific host to remove the urgency for now. Next step after that would be to improve the information we have available in the alert. However I would be ok to call this ticket done after we resolved the immediate issue except someone has a really good idea of how to improve the alert. We probably all agree that it could look better but wishful thinking also does not bring us further ;)
- Due date set to 2020-10-22
- Status changed from Workable to Feedback
- Assignee set to okurz
- Priority changed from Urgent to High
created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/383 to prevent the confusion due to the name of the alert and monitoring panel.
I assume that mkittler applied a complete "webui salt state" on that machine and that included even something like /etc/cron.d/SLES.CRON running some cron jobs which likely fail.
roles: - webui
from /etc/salt/grains, restarted salt-master service and did
rm /etc/cron.d/SLES.CRON. Setting to feedback and waiting if the alert still triggers.
I have another change prepared to cover the suggestions from nsinger to improve the output in case of alerts. But I can not push these changes to gitlab because my account seems to be blocked now, reported in https://infra.nue.suse.com/SelfService/Display.html?id=178670