action #163778
Updated by nicksinger 5 months ago
## Observation We had several alerts regarding s390zl12 today, firing and resolving shortly after each other: http://stats.openqa-monitor.qa.suse.de/alerting/grafana/5ddd66bd99b31f7597fd68af2cc96304f8d9e480/view?orgId=1 http://stats.openqa-monitor.qa.suse.de/alerting/grafana/openqa_ping_time_alert_s390zl12/view?orgId=1 Date: Thu, 11 Jul 2024 14:24:34 +0200 From: Grafana <osd-admins@suse.de> To: osd-admins@suse.de Subject: [FIRING:2] s390zl12 2 firing alert instances ## Suggestions * ~~Check Check the machines. If you can't reach them then use the fancy s390x admin interface on https://zhmc2.suse.de, see https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/a67bce38ddcc46a9c756d3866b794fc6bdc1d900/openqa/workerconf.sls#L2464~~ *DONE, both machines can be reached and systemctl shows "running"* https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/a67bce38ddcc46a9c756d3866b794fc6bdc1d900/openqa/workerconf.sls#L2464 * ~~Ensure Ensure a clean salt state applied to both as they are both in salt~~ *DONE, no issues applying a highstate on both hosts* salt * Ensure that openQA jobs using those hosts work fine again * Most of them do on zl12. Only single worker-instances seem to be affected be "Error connecting to VNC server <s390kvm091.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: Connection refused" * e.g. https://openqa.suse.de/admin/workers/2613 (worker32:7) * e.g. https://openqa.suse.de/admin/workers/2618 (worker32:12) * "Connection refused" points to firewall. Could be related to our own implementation of it: https://progress.opensuse.org/issues/159066#note-16 ## Rollback steps * Remove the silence `rule_uid=~(host_up|ping_time)_alert_s390zl1[23]` from https://monitor.qa.suse.de/alerting/silences