action #178015
closedcoordination #161414: [epic] Improved salt based infrastructure management
[false negative] Many failed systemd services but no alert has fired size:S
0%
Description
Observation¶
It often starts innocent like in https://suse.slack.com/archives/C02CANHLANP/p1740668762857669 when José Fernández asked why a change in os-autoinst-distri-opensuse does not seem to work on aarch64. Some steps later digging down the rabbit hole I found that we have many failed systemd services on various hosts which https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services happily shows along with green hearts and there are no related firing alerts though there should be.
Acceptance Criteria¶
- AC1: It is understood why aarch64 revealed issues with systemd services and follow-up tickets are filed
Suggestions¶
- Check current alert definitions in grafana
- Check our git history in https://gitlab.suse.de/openqa/salt-states-openqa or ticket history for potential regression introducing candidates
- Identify the problem and fix it and let the team learn how it came to this
Rollback steps¶
- Reset the failed state of
openqa-reload-worker-auto-restart@999
on worker33 and runsystemctl unmask openqa-worker-auto-restart@999
.
Updated by mkittler 3 months ago · Edited
Link to Grafana with the relevant time window: https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=2025-02-24T09%3A05%3A51.970Z&to=2025-02-28T18%3A50%3A29.508Z&timezone=UTC
So the alert query seems to be correct. The alert condition also makes sense.
The alert was also recently firing (2025-02-17 04:36:24). It should have been firing much sooner, though: https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=2025-02-04T22%3A54%3A39.117Z&to=2025-02-17T14%3A43%3A34.958Z&timezone=UTC
(explore link: https://monitor.qa.suse.de/explore?schemaVersion=1&panes=%7B%227c0%22%3A%7B%22datasource%22%3A%22000000001%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22alias%22%3A%22Sum%20of%20failed%20systemd%20services%22%2C%22groupBy%22%3A%5B%7B%22params%22%3A%5B%22%24__interval%22%5D%2C%22type%22%3A%22time%22%7D%2C%7B%22params%22%3A%5B%22null%22%5D%2C%22type%22%3A%22fill%22%7D%5D%2C%22interval%22%3A%221m%22%2C%22intervalMs%22%3A1000%2C%22maxDataPoints%22%3A43200%2C%22measurement%22%3A%22systemd_failed%22%2C%22orderByTime%22%3A%22ASC%22%2C%22policy%22%3A%22default%22%2C%22resultFormat%22%3A%22time_series%22%2C%22select%22%3A%5B%5B%7B%22params%22%3A%5B%22failed%22%5D%2C%22type%22%3A%22field%22%7D%2C%7B%22params%22%3A%5B%5D%2C%22type%22%3A%22sum%22%7D%5D%5D%2C%22tags%22%3A%5B%7B%22key%22%3A%22host%22%2C%22operator%22%3A%22!%3D%22%2C%22value%22%3A%22openqa%22%7D%5D%2C%22datasource%22%3A%7B%22type%22%3A%22influxdb%22%2C%22uid%22%3A%22000000001%22%7D%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221738714427023%22%2C%22to%22%3A%221739830938673%22%7D%7D%7D&orgId=1)
The weird think is that we have an "ok" marker without a preceding "firing" marker and the "ok" marker appears right in the middle of a problematic section where nothing was ok.
So I guess the averaging we had before https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/4429f893f545dc91c06db4a4db0b5d17ccadb457 made some sense. However, if we just revert the MR is would lead to the also not desirable behavior we had before.
Updated by mkittler 3 months ago · Edited
- Status changed from In Progress to Feedback
MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1389
EDIT: The change has been deployed. I restarted the grafana service and it looks as expected on the web UI.
Updated by okurz 3 months ago
- Blocks action #177318: 2 bare-metal machines are offline on OSD added
Updated by mkittler 3 months ago
This still doesn't work, now we get a cycle between pending and ok. (And I provoked openqa-reload-worker-auto-restart@999
to be constantly failing on worker33.)
Maybe it makes sense to switch to @nicksinger 's approach then: https://stats.openqa-monitor.qa.suse.de/alerting/grafana/beefj548t0a2oc/view
Updated by mkittler 3 months ago
- Blocks deleted (action #177318: 2 bare-metal machines are offline on OSD)
Updated by mkittler 3 months ago
- Related to action #177318: 2 bare-metal machines are offline on OSD added
Updated by mkittler 3 months ago
- Description updated (diff)
- Assignee changed from mkittler to nicksinger
I discussed this with @nicksinger who adjusted his approach at the same time. It works now by looking at a time interval of 5 minutes.
My previous attempt to increase the interval/time-grouping of the current alert to 150 seconds turned out to be insufficient.
We decided to go for @nicksinger's change as we now also saw that it actually works.
Updated by openqa_review 3 months ago
- Due date set to 2025-03-18
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 3 months ago
I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1393 to replace the single alert with an instantiated one which can now be seen here: https://monitor.qa.suse.de/alerting/grafana/beefj548t0a2oc/view. Each machine has its own alert-instance as can be seen on https://monitor.qa.suse.de/alerting/grafana/beefj548t0a2oc/view?tab=instances
The graph shows the old alert at the beginning and since ~2025-03-05 12:00 it shows the new one https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&viewPanel=panel-6&from=2025-03-04T09:34:03.000Z&to=now&timezone=UTC - without being flaky or triggering on and off. I however realized that including the units as tag might not have been a good idea since without any failing units the contents of the tags change and therefore the alert instance definition. I will remove them again and we can look later into including the failed units into alert mails.
Updated by nicksinger 3 months ago
- Status changed from In Progress to Resolved
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1396 merged and deployed (I had to use alerts_to_delete.yaml
to redeploy my changes). This should be sufficient for now