Project

General

Profile

action #125303

Updated by okurz almost 2 years ago

## Observation 
 We received firing/resolved mails for the alert on panel http://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker14?viewPanel=65105. The worker has been running since Feb 26 03:35:21 so unlike worker11/13 there was no crash. The alert was firing with "DatasourceNoData" so maybe there was just a temporary connection issue. Maybe this kind of alert should have actually been suppressed but this doesn't work anymore since we've been migrating to the new alerting system? 

 ## Acceptance criteria 
 * **AC1:** We receive no "no data" alert emails same as we had before migrating to unified alerting in grafana 

 ## Suggestions 
 * Wait for #122845 
 * Try to reproduce the problem, e.g. just stop telegraf on worker11 (or worker14, the originally affected one) and see if we receive alerts 
 * Research how to configure alerts accordingly to not notify if there is no data for a certain time, e.g. read upstream documentation, blog posts about new unified alerting, etc. 
 * Crosscheck all our alert configs so that we ensure what we had for the past 

 ## Rollback steps 
 * Unsilence [openqa-piworker: host up alert](http://stats.openqa-monitor.qa.suse.de/alerting/silence/new?alertmanager=grafana&matcher=alertname%3DDatasourceNoData&matcher=datasource_uid%3D000000001&matcher=ref_id%3DA&matcher=rule_uid%3Dm3MU-u04k&matcher=rulename%3Dopenqa-piworker%3A+host+up+alert) 
 * Remove the silence for [all NoData alerts](https://stats.openqa-monitor.qa.suse.de/alerting/silence/ed350986-95b8-4eb7-afa0-97d6b92e55b0/edit?alertmanager=grafana)

Back