Actions
action #125744
closed[tools][alert][FIRING:1] (Failed systemd services alert (except openqa.suse.de) QDG8aXAVz) due to openqa-piworker.qa.suse.de unable to reach openqa.suse.de
Start date:
2023-03-10
Due date:
% Done:
0%
Estimated time:
Tags:
Description
Observation¶
*Firing: 1 alert *
Firing
_*Failed systemd services alert (except openqa.suse.de) *_
*Value:* [ var='B0' metric='Sum of failed systemd services' labels={} value=1 ]
*message:* Check failed systemd services on hosts with `systemctl --failed`. Hint: Go to parent dashboard https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services to see a list of affected hosts.
*Labels:*
* alertname: Failed systemd services alert (except openqa.suse.de)
* rule_uid: QDG8aXAVz
[2]* Silence *[3][4]* Go to Dashboard *[5][4]* Go to Panel [6]Source[7]*
*Go to alerts page*[8]
[3] http://stats.openqa-monitor.qa.suse.de/alerting/silence/new?alertmanager=grafana&matcher=alertname%3DFailed+systemd+services+alert+%28except+openqa.suse.de%29&matcher=rule_uid%3DQDG8aXAVz
[6] http://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz?viewPanel=6
[7] http://stats.openqa-monitor.qa.suse.de/alerting/grafana/QDG8aXAVz/view
[8] http://stats.openqa-monitor.qa.suse.de/alerting/list?alertState=firing&view=state
likely due to openqa-piworker.qa.suse.de unable to reach openqa.suse.de which dheidler also reported as a problem yesterday in https://suse.slack.com/archives/C02CANHLANP/p1678375014522009
Rollback steps¶
- Unsilence alert "Packet loss between worker hosts and other hosts" https://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk/monitoring?viewPanel=4&orgId=1
Suggestions¶
- Investigate DNS resolution on openqa-piworker.qa.suse.de, optionally together with dheidler
- Fix problem
- Where applicable apply the same solution to other machines in FC Basement
- Crosscheck monitoring data and unpause related alerts
Actions