Project

General

Profile

action #163778

Updated by nicksinger 5 months ago

## Observation 

 We had several alerts regarding s390zl12 today, firing and resolving shortly after each other: 
 http://stats.openqa-monitor.qa.suse.de/alerting/grafana/5ddd66bd99b31f7597fd68af2cc96304f8d9e480/view?orgId=1 
 http://stats.openqa-monitor.qa.suse.de/alerting/grafana/openqa_ping_time_alert_s390zl12/view?orgId=1 

 Date: Thu, 11 Jul 2024 14:24:34 +0200 
 From: Grafana <osd-admins@suse.de> 
 To: osd-admins@suse.de 
 Subject: [FIRING:2] s390zl12 

 2 firing alert instances 

 ## Suggestions 
 * ~~Check Check the machines. If you can't reach them then use  
 the fancy s390x admin interface on https://zhmc2.suse.de, see 
 https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/a67bce38ddcc46a9c756d3866b794fc6bdc1d900/openqa/workerconf.sls#L2464~~ *DONE, both machines can be reached and systemctl shows "running"* https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/a67bce38ddcc46a9c756d3866b794fc6bdc1d900/openqa/workerconf.sls#L2464 
 * ~~Ensure Ensure a clean salt state applied to both as they are both in salt~~ *DONE, no issues applying a highstate on both hosts* salt 
 * Ensure that openQA jobs using those hosts work fine again 
   * Most of them do on zl12. Only single worker-instances seem to be affected be "Error connecting to VNC server <s390kvm091.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: Connection refused" 
   * e.g. https://openqa.suse.de/admin/workers/2613 (worker32:7) 
   * e.g. https://openqa.suse.de/admin/workers/2618 (worker32:12) 
   * "Connection refused" points to firewall. Could be related to our own implementation of it: https://progress.opensuse.org/issues/159066#note-16 

 ## Rollback steps 

 * Remove the silence `rule_uid=~(host_up|ping_time)_alert_s390zl1[23]` from https://monitor.qa.suse.de/alerting/silences

Back