Actions
action #118891
openMake alerts depend on each other
Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
QA (public, currently private due to #173521) - future
Start date:
2022-10-14
Due date:
% Done:
0%
Estimated time:
Description
Observation¶
Our alerts operate on different levels of a system. Starting from general checks like a host is "up" and network is reachable up to checking output of services (e.g. worker minions). If machines are down this means of course that its services are down too resulting in a lot of mails/alerts. We should introduce a way to disable more sophisticated checks if basic ones already fail.
Acceptance criteria¶
- AC1: Offline machines create just a single alert/e-mail
- AC1.1: We get reminded or have an overview about the current status
Suggestions¶
- Make use of the "automatic actions pipeline/dashboard" (https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions) to trigger pausing alerts via http api (https://grafana.com/docs/grafana/v9.0/developers/http_api/alerting/#pause-alert-by-id)
- Create a pipeline in gitlab which pauses alerts and can be triggered by a webhook
- Create an grafana alert channel similar to the automatic arm reboot (see https://stats.openqa-monitor.qa.suse.de/alerting/notification/3/edit)
- Create an according panel with the webhook alert channel configured to trigger if a machine is down
- Alternatively: Try if we can make use of grafana 9 and it's newly created unified alerting (https://grafana.com/blog/2022/06/14/grafana-alerting-explore-our-latest-updates-in-grafana-9/). This automatically groups alerts, sends reminders and might even have more features to make alerts depend on each other
- Possibly a combination of both (e.g. make a specific group, based on tags, of alerts trigger a specific pipeline silencing the whole group/tag)
Updated by okurz about 2 years ago
- Related to action #118375: Do not alert about "packet loss" if hosts are down added
Actions