action #163778
Updated by nicksinger 5 months ago
## Observation
We had several alerts regarding s390zl12 today, firing and resolving shortly after each other:
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/5ddd66bd99b31f7597fd68af2cc96304f8d9e480/view?orgId=1
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/openqa_ping_time_alert_s390zl12/view?orgId=1
Date: Thu, 11 Jul 2024 14:24:34 +0200
From: Grafana <osd-admins@suse.de>
To: osd-admins@suse.de
Subject: [FIRING:2] s390zl12
2 firing alert instances
## Suggestions
* ~~Check the machines. If you can't reach them then use
the fancy s390x admin interface on https://zhmc2.suse.de, see
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/a67bce38ddcc46a9c756d3866b794fc6bdc1d900/openqa/workerconf.sls#L2464~~ *DONE, both machines can be reached and systemctl shows "running"*
* ~~Ensure a clean salt state applied to both as they are both in salt~~ *DONE, no issues applying a highstate on both hosts*
* Ensure that openQA jobs using those hosts work fine again
* Most of them do on zl12. Only single worker-instances seem to be affected be "Error connecting to VNC server <s390kvm091.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: Connection refused"
* e.g. https://openqa.suse.de/admin/workers/3867 (worker32:6)
* e.g. https://openqa.suse.de/admin/workers/2613 (worker32:7)
* e.g. https://openqa.suse.de/admin/workers/2618 (worker32:12)
* "Connection refused" points to firewall. Could be related to our own implementation of it: https://progress.opensuse.org/issues/159066#note-16
## Rollback steps
* Remove the silence `rule_uid=~(host_up|ping_time)_alert_s390zl1[23]` from https://monitor.qa.suse.de/alerting/silences