action #178015
opencoordination #161414: [epic] Improved salt based infrastructure management
[false negative] Many failed systemd services but no alert
0%
Description
Observation¶
It often starts innocent like in https://suse.slack.com/archives/C02CANHLANP/p1740668762857669 when José Fernández asked why a change in os-autoinst-distri-opensuse does not seem to work on aarch64. Some steps later digging down the rabbit hole I found that we have many failed systemd services on various hosts which https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services happily shows along with green hearts and there are no related firing alerts though there should be.
Suggestions¶
- Check current alert definitions in grafana
- Check our git history in https://gitlab.suse.de/openqa/salt-states-openqa or ticket history for potential regression introducing candidates
- Identify the problem and fix it and let the team learn how it came to this
Updated by mkittler 3 days ago · Edited
Link to Grafana with the relevant time window: https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=2025-02-24T09%3A05%3A51.970Z&to=2025-02-28T18%3A50%3A29.508Z&timezone=UTC
So the alert query seems to be correct. The alert condition also makes sense.
The alert was also recently firing (2025-02-17 04:36:24). It should have been firing much sooner, though: https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=2025-02-04T22%3A54%3A39.117Z&to=2025-02-17T14%3A43%3A34.958Z&timezone=UTC
(explore link: https://monitor.qa.suse.de/explore?schemaVersion=1&panes=%7B%227c0%22%3A%7B%22datasource%22%3A%22000000001%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22alias%22%3A%22Sum%20of%20failed%20systemd%20services%22%2C%22groupBy%22%3A%5B%7B%22params%22%3A%5B%22%24__interval%22%5D%2C%22type%22%3A%22time%22%7D%2C%7B%22params%22%3A%5B%22null%22%5D%2C%22type%22%3A%22fill%22%7D%5D%2C%22interval%22%3A%221m%22%2C%22intervalMs%22%3A1000%2C%22maxDataPoints%22%3A43200%2C%22measurement%22%3A%22systemd_failed%22%2C%22orderByTime%22%3A%22ASC%22%2C%22policy%22%3A%22default%22%2C%22resultFormat%22%3A%22time_series%22%2C%22select%22%3A%5B%5B%7B%22params%22%3A%5B%22failed%22%5D%2C%22type%22%3A%22field%22%7D%2C%7B%22params%22%3A%5B%5D%2C%22type%22%3A%22sum%22%7D%5D%5D%2C%22tags%22%3A%5B%7B%22key%22%3A%22host%22%2C%22operator%22%3A%22!%3D%22%2C%22value%22%3A%22openqa%22%7D%5D%2C%22datasource%22%3A%7B%22type%22%3A%22influxdb%22%2C%22uid%22%3A%22000000001%22%7D%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221738714427023%22%2C%22to%22%3A%221739830938673%22%7D%7D%7D&orgId=1)
The weird think is that we have an "ok" marker without a preceding "firing" marker and the "ok" marker appears right in the middle of a problematic section where nothing was ok.
So I guess the averaging we had before https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/4429f893f545dc91c06db4a4db0b5d17ccadb457 made some sense. However, if we just revert the MR is would lead to the also not desirable behavior we had before.
Updated by mkittler 3 days ago · Edited
- Status changed from In Progress to Feedback
MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1389
EDIT: The change has been deployed. I restarted the grafana service and it looks as expected on the web UI.
Updated by okurz 3 days ago
- Blocks action #177318: 2 bare-metal machines are offline on OSD added
Updated by mkittler about 5 hours ago
- Status changed from Feedback to In Progress
So far there were no failing systemd services. I guess I'll provoke a failing unit to see whether it works.
Updated by mkittler about 3 hours ago
This still doesn't work, now we get a cycle between pending and ok. (And I provoked openqa-reload-worker-auto-restart@999
to be constantly failing on worker33.)
Maybe it makes sense to switch to @nicksinger 's approach then: https://stats.openqa-monitor.qa.suse.de/alerting/grafana/beefj548t0a2oc/view
Updated by mkittler about 3 hours ago
- Blocks deleted (action #177318: 2 bare-metal machines are offline on OSD)
Updated by mkittler about 3 hours ago
- Related to action #177318: 2 bare-metal machines are offline on OSD added