Project

General

Profile

action #163778

Updated by okurz 15 days ago

## Observation 

 We had several alerts regarding s390zl12 today, firing and resolving shortly after each other: 
 http://stats.openqa-monitor.qa.suse.de/alerting/grafana/5ddd66bd99b31f7597fd68af2cc96304f8d9e480/view?orgId=1 
 http://stats.openqa-monitor.qa.suse.de/alerting/grafana/openqa_ping_time_alert_s390zl12/view?orgId=1 

 ``` 
 Date: Thu, 11 Jul 2024 14:24:34 +0200 
                                                                                                                                                                          
 From: Grafana <osd-admins@suse.de> 
                                                                                                                                                                             
 To: osd-admins@suse.de 
                                                                                                                                                                                         
 Subject: [FIRING:2] s390zl12 

                                                                                                                                                                                   

 2 firing alert instances 
 [IMAGE] 

 ## Suggestions 📁 GROUPED BY  

 hostname=s390zl12 

   🔥 2 firing instances 

 Firing [stats.openqa-monitor.qa.suse.de] 
 * Check the machines. If you can't reach them then use  
 the fancy s390x admin interface on https://zhmc2.suse.de, see s390zl12: Ping time alert 
 https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/a67bce38ddcc46a9c756d3866b794fc6bdc1d900/openqa/workerconf.sls#L2464 View alert [stats.openqa-monitor.qa.suse.de] 
 * Ensure a clean salt state applied to both as they are both in salt Values 
 * Ensure that B0=372.96250000000003  
 Labels 
 alertname 
 s390zl12: Ping time alert 
 grafana_folder 
 Generic 
 hostname 
 s390zl12 
 rule_uid 
 ping_time_alert_s390zl12 
 type 
 generic 
 Silence [stats.openqa-monitor.qa.suse.de] 
 View dashboard [stats.openqa-monitor.qa.suse.de] 
 View panel [stats.openqa-monitor.qa.suse.de] 
 Observed 34s before this notification was delivered, at 2024-07-11 14:24:00 +0200 CEST 
 Firing [stats.openqa-monitor.qa.suse.de] 
 s390zl12: OpenQA Ping time alert 
 View alert [stats.openqa-monitor.qa.suse.de] 
 Values 
 B0=372.96250000000003  
 Labels 
 alertname 
 s390zl12: OpenQA Ping time alert 
 grafana_folder 
 openQA jobs using those hosts work fine again 
 hostname 
 s390zl12 
 rule_uid 
 openqa_ping_time_alert_s390zl12 
 type 
 worker 

 ## Rollback steps 

 * Remove the silence `rule_uid=~(host_up|ping_time)_alert_s390zl1[23]` from https://monitor.qa.suse.de/alerting/silences two silences

Back