action #133130
closedLots of alerts for a single cause. Can we group and de-duplicate?
0%
Description
Observation¶
Received the following alert emails:
- sapworker1: host up alert
- sapworker1: OpenQA Ping time alert
- sapworker2: host up alert
- ...
- sapworker3: OpenQA Ping time alert
- sapworker3: Ping time alert
- Average Ping time (ms) alert
all for a singular reason: Problem with the Frankencampus network. Can we group alerts and also not have host up and openQA ping time and ping time alerts?
Acceptance criteria¶
- AC1: Grouped alerts, grafana supports this!
- AC2: No separate ping time alerts if there is a corresponding host up alert, at least the ping time should come much later than the host up
Suggestions¶
- Read https://grafana.com/docs/grafana/latest/alerting/manage-notifications/view-alert-groups/ and https://community.grafana.com/t/how-to-create-alert-group/71327
- Look into "grafana alert grouping" and configure alerts accordingly
- Crosscheck alerting time threshold, like pick sensible values for "host up" vs. "ping time" or "packet loss".
Rollback steps¶
- Remove according silences from https://monitor.qa.suse.de/alerting/silences either referencing this ticket or anything concerning "host up" or "ping time"
Updated by okurz over 1 year ago
- Copied from action #133127: Frankencampus network broken + GitlabCi failed --> uploading artefacts added
Updated by okurz over 1 year ago
- Related to action #132788: [alert][flaky] QA-Power8-5-kvm: QA network infrastructure Ping time alert added
Updated by okurz over 1 year ago
- Subject changed from Lots of alerts for a single cause. Can we group and de-duplicate? to Lots of alerts for a single cause. Can we group and de-duplicate? size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by nicksinger over 1 year ago
- Status changed from Workable to Feedback
I modified our "Notification policy" to group by "hostname". It waits for 30s to send out all alerts by group which might not be enough but I don't know how I could test it. Setting to "feedback" to just wait for now until something bigger breaks.
Updated by livdywan over 1 year ago
- Priority changed from Urgent to Normal
We discussed it a couple times now but unfortunately without updating the ticket. I'm lowering the urgency to reflect the fact that this is mostly done with the changes Nick already implemented.
Updated by nicksinger over 1 year ago
- Copied to action #133991: Cover same metric for different hosts with a single alert rule added
Updated by nicksinger over 1 year ago
- Copied to deleted (action #133991: Cover same metric for different hosts with a single alert rule)
Updated by nicksinger over 1 year ago
- Related to action #133991: Cover same metric for different hosts with a single alert rule added
Updated by nicksinger over 1 year ago
livdywan wrote:
We discussed it a couple times now but unfortunately without updating the ticket. I'm lowering the urgency to reflect the fact that this is mostly done with the changes Nick already implemented.
As mentioned we already group by hostname (which is a already present label in our setup). I've additionally created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/935 to introduce a common label for every instance of "host_up" alerts and created a follow-up #133991 to cover reducing the amount of alert rule instances per metric.
Updated by nicksinger over 1 year ago
- Status changed from Feedback to Resolved
I've added another "nested policy" below our general __contacts__ =~ .*"osd-admins".*
-policy (which does the grouping by hostname). If the label of an alert is alert = host_up
, it will group by alert
. Not sure how this behaves in combination with the parent grouping but IIUC this is an "overwrite" so it should not interfere.
Resolving for now. If we see it not working out we can reopen it.
Updated by livdywan over 1 year ago
- Status changed from Resolved to Workable
- Assignee deleted (
nicksinger)
We decided to re-open because we couldn't tellif this was alerting and maybe it "should have"?
Updated by livdywan about 1 year ago
- Copied to action #135470: Grafana: Average Ping time (ms) alert with unexpanded variable "${tag_url}", which machine is this about? size:M added
Updated by okurz about 1 year ago
- Target version changed from Ready to Tools - Next
Updated by okurz about 1 year ago
- Target version changed from Tools - Next to Ready
Updated by okurz about 1 year ago
- Related to action #138044: Grouped seemingly unrelated alert emails are confusing size:M added
Updated by okurz about 1 year ago
- Target version changed from Ready to Tools - Next
Updated by okurz about 1 year ago
- Target version changed from Tools - Next to Ready
Updated by okurz 11 months ago
- Status changed from New to Resolved
- Assignee set to nicksinger
In the daily infra call we tried to estimate this ticket again but we did not understand the comment
We decided to re-open because we couldn't tellif this was alerting and maybe it "should have"?
Also we checked the mailing list and did not found many occurences of grouping at all. Examples like https://mailman.suse.de/mlarch/SuSE/osd-admins/2023/osd-admins.2023.11/msg00120.html with subject "FIRING:2" don't look that helpful but we wouldn't know what to do.
Obviously after this long time if you see a related issue please better reopen a linked ticket with explicit description, not reopen.