Project

General

Profile

action #133892

Updated by mkittler over 1 year ago

## Observation 

 At Sat, 05 Aug 2023 19:55:08 +0200 we got an alert from Grafana. 
 ``` 
 1 firing alert instance 
 [IMAGE] 
  GROUPED BY  

 BY  

 hostname=arm-worker2 

  1 firing instances 

 Firing [stats.openqa-monitor.qa.suse.de] 
 arm-worker2: host up alert 
 View alert [stats.openqa-monitor.qa.suse.de] 
 Values 
 B0=1  
 B0=1  
 Labels 
 alertname 
 arm-worker2: host up alert 
 grafana_folder 
 openQA 
 hostname 
 arm-worker2 
 rule_uid 
 host_up_alert_arm-worker2 
 type 
 worker 
 Annotations 
 message 
 No data received for pings from worker to central host, likely host is down (or split network). See https://progress.opensuse.org/issues/71098 for details 
 ``` 

 https://stats.openqa-monitor.qa.suse.de/alerting/grafana/host_up_alert_arm-worker2/view?orgId=1 

 I can't see anything in the panel, though. 

 According to the alarm history it is in alarm state already since August 3, nothing in the panel either, though. 

 ## Problem 
 The machine "arm-worker2" does not exist anymore and shouldn't exist, like "d105" and similar named machines, which were temporary DHCP lease names. Apparently alerts for not-in-salt machines are not removed anymore automatically. 

 ## Acceptance criteria 
 * **AC1:** Alert rules for hosts that are not salt controlled are removed automatically (or semi-automatically) 
 * **AC2:** There are no firing alerts anymore for not anymore existing hosts 

 ## Suggestions 
 * Find out where the still marked as "provisioned" alert rule https://stats.openqa-monitor.qa.suse.de/alerting/grafana/host_up_alert_arm-worker2/view?orgId=1 comes from. Maybe something on monitor.qa.suse.de in /etc/telegraf/ or somewhere in gitlab.suse.de/openqa/salt-states-openqa 
 * Try to find a way that not anymore existing host alert rule references are automatically cleaned up. For example when a salt high state is applied in the pipeline of gitlab.suse.de/openqa/salt-states-openqa . If not possible to do fully automatic then provide instructions how to handle that in gitlab.suse.de/openqa/salt-states-openqa#openqa-salt-states README 
 * Clean up the current state 
 * Ensure that according alert rules are removed 
 * Ensure that no related alerts are still firing

Back