Actions
action #178015
opencoordination #161414: [epic] Improved salt based infrastructure management
[false negative] Many failed systemd services but no alert
Status:
In Progress
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2025-02-27
Due date:
% Done:
0%
Estimated time:
Tags:
Description
Observation¶
It often starts innocent like in https://suse.slack.com/archives/C02CANHLANP/p1740668762857669 when José Fernández asked why a change in os-autoinst-distri-opensuse does not seem to work on aarch64. Some steps later digging down the rabbit hole I found that we have many failed systemd services on various hosts which https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services happily shows along with green hearts and there are no related firing alerts though there should be.
Suggestions¶
- Check current alert definitions in grafana
- Check our git history in https://gitlab.suse.de/openqa/salt-states-openqa or ticket history for potential regression introducing candidates
- Identify the problem and fix it and let the team learn how it came to this
Rollback steps¶
- Reset the failed state of
openqa-reload-worker-auto-restart@999
on worker33 and runsystemctl unmask openqa-worker-auto-restart@999
.
Actions