action #163778
closed[alert] host_up & Average Ping time (ms) alert for s390zl12&s390zl13 size:S
0%
Description
Observation¶
We had several alerts regarding s390zl12 today, firing and resolving shortly after each other:
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/5ddd66bd99b31f7597fd68af2cc96304f8d9e480/view?orgId=1
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/openqa_ping_time_alert_s390zl12/view?orgId=1
Date: Thu, 11 Jul 2024 14:24:34 +0200
From: Grafana osd-admins@suse.de
To: osd-admins@suse.de
Subject: [FIRING:2] s390zl12
2 firing alert instances
Suggestions¶
Check the machines. If you can't reach them then use the fancy s390x admin interface on https://zhmc2.suse.de, see https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/a67bce38ddcc46a9c756d3866b794fc6bdc1d900/openqa/workerconf.sls#L2464DONE, both machines can be reached and systemctl shows "running"Ensure a clean salt state applied to both as they are both in saltDONE, no issues applying a highstate on both hosts- Ensure that openQA jobs using those hosts work fine again
- Most of them do on zl12. Only single worker-instances seem to be affected be "Error connecting to VNC server s390kvm091.oqa.prg2.suse.org:5901: IO::Socket::INET: connect: Connection refused"
- e.g. https://openqa.suse.de/admin/workers/3867 (worker32:6)
- e.g. https://openqa.suse.de/admin/workers/2613 (worker32:7)
- e.g. https://openqa.suse.de/admin/workers/2618 (worker32:12)
- "Connection refused" points to firewall. Could be related to our own implementation of it: https://progress.opensuse.org/issues/159066#note-16
Rollback steps¶
- Remove the silence
rule_uid=~(host_up|ping_time)_alert_s390zl1[23]
from https://monitor.qa.suse.de/alerting/silences
Updated by okurz 4 months ago
- Tags set to infra, alert, s390, reactive work
- Priority changed from Normal to Urgent
@tinita thank you. 2 days is rather on the low side which means we need to treat this ticket as urgent. I would be ok if we set something like 2 months silence and mention the removal of silence as rollback step
Updated by nicksinger 4 months ago
- Subject changed from [alert] host_up & Average Ping time (ms) alert for s390zl12&s390zl13 size:S to [alert] host_up & Average Ping time (ms) alert for s390zl12&s390zl13 size:S - auto_review:"Error connecting to VNC server <.*:5901>: IO::Socket::INET: connect: Connection refused":retry
Updated by nicksinger 4 months ago
- Subject changed from [alert] host_up & Average Ping time (ms) alert for s390zl12&s390zl13 size:S - auto_review:"Error connecting to VNC server <.*:5901>: IO::Socket::INET: connect: Connection refused":retry to [alert] host_up & Average Ping time (ms) alert for s390zl12&s390zl13 size:S
Sorry, the auto review makes no sense here and is already covered in https://progress.opensuse.org/issues/76813 which might duplicate this here then - but not sure. Asked @livdywan and @okurz in slack how we want to handle this ticket here.
Updated by nicksinger 4 months ago
- Status changed from Workable to In Progress
- Assignee set to nicksinger
Updated by nicksinger 4 months ago
Going forward with this I will focus on getting rid of the false alerts but not further investigating the failing tests because currently everything looks like these two issues are not linked with each other and just happen on the same host simultaneously.
Updated by openqa_review 4 months ago
- Due date set to 2024-08-06
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 4 months ago
As discussed in the daily the next step I will take is to compare response times of these machines with other ones (same network, different network, architecture, etc) by using grafana. Based on these results I can better understand what would be helpful next steps (e.g. SD-Ticket about network performance, Debug s390 specifically, bump alert thresholds, etc).
Updated by nicksinger 4 months ago
- Status changed from In Progress to Resolved
nicksinger wrote in #note-14:
As discussed in the daily the next step I will take is to compare response times of these machines with other ones (same network, different network, architecture, etc) by using grafana. Based on these results I can better understand what would be helpful next steps (e.g. SD-Ticket about network performance, Debug s390 specifically, bump alert thresholds, etc).
while discussing further I took another look at the alert history. Unfortunately the last entry is from 2024-07-11 which is apparently the same date the silence was created so not sure if it will still fire but currently everything looks fine. If further debugging is needed we can reopen again.
Updated by livdywan 2 months ago
- Related to action #166136: s390 LPAR s390ZL12 down and unable to boot - potential corrupted filesystem added