action #125303
prevent confusing "no data" alerts size:M
0%
Description
Observation¶
We used "no data" as alert trigger for our "host up" alerts. This caused confusion after switching to the new unified alerting system in grafana because we thought that no data was provided by telegraf while in reality it was a valid alert.
Acceptance criteria¶
- AC1: We don't rely on "no data"-triggers for other purposes (e.g. host up, etc)
Suggestions¶
- Wait for a Grafana 9.1 update so we can provision alerts from files
- Change the "host up"-alert from using "average_response_ms" to "result_code"
- Crosscheck if we already have a solution for telegraf not being able to push data to influxdb
Related issues
History
#3
Updated by okurz 3 months ago
- Subject changed from openqaworker14: host up alert firing from 11:50 to 14:30 on 02.03.23 to ensure no "no data" alerts after migrating to unified alerting in grafana (was: openqaworker14: host up alert firing from 11:50 to 14:30 on 02.03.23)
- Description updated (diff)
- Status changed from New to Blocked
- Assignee set to nicksinger
nicksinger please track this ticket as being blocked by #122845 which you are working on right now
#4
Updated by okurz 3 months ago
- Related to action #122845: Migrate our Grafana setup to "unified alerting" added
#7
Updated by mkittler 3 months ago
- Subject changed from ensure no "no data" alerts after migrating to unified alerting in grafana (was: openqaworker14: host up alert firing from 11:50 to 14:30 on 02.03.23) to ensure no "no data" alerts after migrating to unified alerting in grafana (was: openqaworker14: host up alert firing from 11:50 to 14:30 on 02.03.23) size:M
- Status changed from New to Workable
#10
Updated by okurz 3 months ago
Discussed with nicksinger. The general silencing does not make sense as the "host up" alert was actually the only real alert where we would care about data because "average_response_ms" never returns a value if there is no response. However, it looks like we never challenged that query design which was in since 2019 in salt-states-openqa commit 5ae5356. So I deleted the generic silence again. Also openqa-piworker just reappeared after dheidler fixed the network config so I also unsilenced the specific alert about openqa-piworker.
Looking back into my email archive over the past days I could only find "FIRING.DatasourceNoData.*host up alert" which are the good ones. So it seems we never had an unintended message about NoData. Still, we can try to improve the alert by switching to "result_code" checking which apparently yields 0 in case of successful ping response and 1 otherwise. We changed the alert but as we already know the alert configuration is not saved in the exported json file. So that is something to continue … or manually change *all ping alerts to use "last of packets_received, alert if max is below 1"
#11
Updated by nicksinger 3 months ago
- Subject changed from ensure no "no data" alerts after migrating to unified alerting in grafana (was: openqaworker14: host up alert firing from 11:50 to 14:30 on 02.03.23) size:M to prevent confusing "no data" alerts size:M
- Description updated (diff)
- Status changed from Workable to Blocked
- Priority changed from High to Low
#12
Updated by nicksinger 3 months ago
- Blocked by action #125642: Manage "unified alerting" via salt size:M added
#14
Updated by okurz 3 months ago
grafana 9.3.6 was built in https://build.opensuse.org/package/show/server:monitoring/grafana but not yet published so we can monitor http://download.opensuse.org/repositories/server:/monitoring/15.4/x86_64/?P=grafana* and upgrade as soon as published.
#15
Updated by cdywan 3 months ago
- Status changed from New to In Progress
- Assignee changed from nicksinger to cdywan
We decided to rollback the alerts for now. This can be done by adjusting the config file so I'll take care of that step
#16
Updated by cdywan 3 months ago
cdywan wrote:
We decided to rollback the alerts for now. This can be done by adjusting the config file so I'll take care of that step
For the record the way we write the config file it will be overridden fully which is why I decided against side-stepping salt and prepared an MR for it.
#17
Updated by openqa_review 3 months ago
- Due date set to 2023-03-28
Setting due date based on mean cycle time of SUSE QE Tools
#19
Updated by okurz 3 months ago
The rollback had the effect that apparently the alert list on grafana does not show "unified alerting" alerts anymore and we receive emails in the old format again, like subject "[Alerting]" and "[Ok]" instead of "[FIRING]" and "[RESOLVED]". However https://monitor.qa.suse.de/alerting/list shows an error about "Failed to load Grafana rules state: 404 from rule state endpoint. Perhaps ruler API is not enabled". However that might also come from the automatic upgrade to grafana 9.3.6 which happened over night so I suggest to do a web research for that error message, enable a simple boolean option in the grafana config, restart the service and have it hopefully fixed.
#21
Updated by nicksinger 2 months ago
- Assignee set to nicksinger
#22
Updated by nicksinger 2 months ago
- Status changed from Workable to In Progress
#24
Updated by cdywan about 2 months ago
- Status changed from In Progress to Resolved
- AC1: We don't rely on "no data"-triggers for other purposes (e.g. host up, etc)
I double-checked in the team chat. We still see no data alerts e.g. today there was one *Queue: State (SUSE) alert * but not for the availability of hosts. So the AC is fulfilled. And we can come up with follow-up tickets as needed as part of the regular alert handling.