Project

General

Profile

action #125303

Updated by nicksinger about 1 year ago

## Observation 
 We used "no data" as alert trigger received firing/resolved mails for our "host up" alerts. This caused confusion after switching to the new unified alerting system in grafana because we thought that alert on panel http://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker14?viewPanel=65105. The worker has been running since Feb 26 03:35:21 so unlike worker11/13 there was no data crash. The alert was provided by telegraf while in reality it firing with "DatasourceNoData" so maybe there was just a valid alert. temporary connection issue. Maybe this kind of alert should have actually been suppressed but this doesn't work anymore since we've been migrating to the new alerting system? 

 ## Acceptance criteria 
 * **AC1:** We don't rely on receive no "no data"-triggers for other purposes (e.g. host up, etc) data" alert emails same as we had before migrating to unified alerting in grafana 

 ## Suggestions 
 * Wait for a Grafana 9.1 update so #122845 
 * Try to reproduce the problem, e.g. just stop telegraf on worker11 (or worker14, the originally affected one) and see if we can provision receive alerts from files 
 * Change the "host up"-alert from using "average_response_ms" Research how to "result_code" configure alerts accordingly to not notify if there is no data for a certain time, e.g. read upstream documentation, blog posts about new unified alerting, etc. 
 * Crosscheck if all our alert configs so that we already have a solution ensure what we had for telegraf not being able to push data to influxdb the past 

 ## Rollback steps 
 * Unsilence [openqa-piworker: host up alert](http://stats.openqa-monitor.qa.suse.de/alerting/silence/new?alertmanager=grafana&matcher=alertname%3DDatasourceNoData&matcher=datasource_uid%3D000000001&matcher=ref_id%3DA&matcher=rule_uid%3Dm3MU-u04k&matcher=rulename%3Dopenqa-piworker%3A+host+up+alert) 
 * Remove the silence for [all NoData alerts](https://stats.openqa-monitor.qa.suse.de/alerting/silence/ed350986-95b8-4eb7-afa0-97d6b92e55b0/edit?alertmanager=grafana)

Back