action #163778
Updated by okurz 15 days ago
## Observation
We had several alerts regarding s390zl12 today, firing and resolving shortly after each other:
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/5ddd66bd99b31f7597fd68af2cc96304f8d9e480/view?orgId=1
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/openqa_ping_time_alert_s390zl12/view?orgId=1
```
Date: Thu, 11 Jul 2024 14:24:34 +0200
From: Grafana <osd-admins@suse.de>
To: osd-admins@suse.de
Subject: [FIRING:2] s390zl12
2 firing alert instances
[IMAGE]
## Suggestions 📁 GROUPED BY
hostname=s390zl12
🔥 2 firing instances
Firing [stats.openqa-monitor.qa.suse.de]
* Check the machines. If you can't reach them then use
the fancy s390x admin interface on https://zhmc2.suse.de, see s390zl12: Ping time alert
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/a67bce38ddcc46a9c756d3866b794fc6bdc1d900/openqa/workerconf.sls#L2464 View alert [stats.openqa-monitor.qa.suse.de]
* Ensure a clean salt state applied to both as they are both in salt Values
* Ensure that B0=372.96250000000003
Labels
alertname
s390zl12: Ping time alert
grafana_folder
Generic
hostname
s390zl12
rule_uid
ping_time_alert_s390zl12
type
generic
Silence [stats.openqa-monitor.qa.suse.de]
View dashboard [stats.openqa-monitor.qa.suse.de]
View panel [stats.openqa-monitor.qa.suse.de]
Observed 34s before this notification was delivered, at 2024-07-11 14:24:00 +0200 CEST
Firing [stats.openqa-monitor.qa.suse.de]
s390zl12: OpenQA Ping time alert
View alert [stats.openqa-monitor.qa.suse.de]
Values
B0=372.96250000000003
Labels
alertname
s390zl12: OpenQA Ping time alert
grafana_folder
openQA jobs using those hosts work fine again
hostname
s390zl12
rule_uid
openqa_ping_time_alert_s390zl12
type
worker
## Rollback steps
* Remove the silence `rule_uid=~(host_up|ping_time)_alert_s390zl1[23]` from https://monitor.qa.suse.de/alerting/silences two silences
Back