Actions
action #163778
closed[alert] host_up & Average Ping time (ms) alert for s390zl12&s390zl13 size:S
Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-07-11
Due date:
2024-08-06
% Done:
0%
Estimated time:
Tags:
Description
Observation¶
We had several alerts regarding s390zl12 today, firing and resolving shortly after each other:
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/5ddd66bd99b31f7597fd68af2cc96304f8d9e480/view?orgId=1
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/openqa_ping_time_alert_s390zl12/view?orgId=1
Date: Thu, 11 Jul 2024 14:24:34 +0200
From: Grafana osd-admins@suse.de
To: osd-admins@suse.de
Subject: [FIRING:2] s390zl12
2 firing alert instances
Suggestions¶
Check the machines. If you can't reach them then use the fancy s390x admin interface on https://zhmc2.suse.de, see https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/a67bce38ddcc46a9c756d3866b794fc6bdc1d900/openqa/workerconf.sls#L2464DONE, both machines can be reached and systemctl shows "running"Ensure a clean salt state applied to both as they are both in saltDONE, no issues applying a highstate on both hosts- Ensure that openQA jobs using those hosts work fine again
- Most of them do on zl12. Only single worker-instances seem to be affected be "Error connecting to VNC server s390kvm091.oqa.prg2.suse.org:5901: IO::Socket::INET: connect: Connection refused"
- e.g. https://openqa.suse.de/admin/workers/3867 (worker32:6)
- e.g. https://openqa.suse.de/admin/workers/2613 (worker32:7)
- e.g. https://openqa.suse.de/admin/workers/2618 (worker32:12)
- "Connection refused" points to firewall. Could be related to our own implementation of it: https://progress.opensuse.org/issues/159066#note-16
Rollback steps¶
- Remove the silence
rule_uid=~(host_up|ping_time)_alert_s390zl1[23]
from https://monitor.qa.suse.de/alerting/silences
Actions