Project

General

Profile

Actions

action #133130

closed

Lots of alerts for a single cause. Can we group and de-duplicate?

Added by okurz over 1 year ago. Updated 11 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
2023-07-20
Due date:
% Done:

0%

Estimated time:

Description

Observation

Received the following alert emails:

  • sapworker1: host up alert
  • sapworker1: OpenQA Ping time alert
  • sapworker2: host up alert
  • ...
  • sapworker3: OpenQA Ping time alert
  • sapworker3: Ping time alert
  • Average Ping time (ms) alert

all for a singular reason: Problem with the Frankencampus network. Can we group alerts and also not have host up and openQA ping time and ping time alerts?

Acceptance criteria

  • AC1: Grouped alerts, grafana supports this!
  • AC2: No separate ping time alerts if there is a corresponding host up alert, at least the ping time should come much later than the host up

Suggestions

Rollback steps


Related issues 5 (1 open4 closed)

Related to openQA Infrastructure (public) - action #132788: [alert][flaky] QA-Power8-5-kvm: QA network infrastructure Ping time alertResolvedokurz2023-07-15

Actions
Related to openQA Infrastructure (public) - action #133991: Cover same metric for different hosts with a single alert ruleNew2023-07-20

Actions
Related to openQA Infrastructure (public) - action #138044: Grouped seemingly unrelated alert emails are confusing size:MRejectedokurz2023-10-09

Actions
Copied from openQA Infrastructure (public) - action #133127: Frankencampus network broken + GitlabCi failed --> uploading artefactsResolvedokurz2023-07-20

Actions
Copied to openQA Infrastructure (public) - action #135470: Grafana: Average Ping time (ms) alert with unexpanded variable "${tag_url}", which machine is this about? size:MResolvedmkittler2023-07-202023-12-13

Actions
Actions #1

Updated by okurz over 1 year ago

  • Copied from action #133127: Frankencampus network broken + GitlabCi failed --> uploading artefacts added
Actions #2

Updated by okurz over 1 year ago

  • Related to action #132788: [alert][flaky] QA-Power8-5-kvm: QA network infrastructure Ping time alert added
Actions #3

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #4

Updated by okurz over 1 year ago

  • Subject changed from Lots of alerts for a single cause. Can we group and de-duplicate? to Lots of alerts for a single cause. Can we group and de-duplicate? size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by nicksinger over 1 year ago

  • Assignee set to nicksinger
Actions #6

Updated by nicksinger over 1 year ago

  • Status changed from Workable to Feedback

I modified our "Notification policy" to group by "hostname". It waits for 30s to send out all alerts by group which might not be enough but I don't know how I could test it. Setting to "feedback" to just wait for now until something bigger breaks.

Actions #7

Updated by livdywan over 1 year ago

  • Priority changed from Urgent to Normal

We discussed it a couple times now but unfortunately without updating the ticket. I'm lowering the urgency to reflect the fact that this is mostly done with the changes Nick already implemented.

Actions #8

Updated by nicksinger over 1 year ago

  • Copied to action #133991: Cover same metric for different hosts with a single alert rule added
Actions #9

Updated by nicksinger over 1 year ago

  • Copied to deleted (action #133991: Cover same metric for different hosts with a single alert rule)
Actions #10

Updated by nicksinger over 1 year ago

  • Related to action #133991: Cover same metric for different hosts with a single alert rule added
Actions #11

Updated by nicksinger over 1 year ago

livdywan wrote:

We discussed it a couple times now but unfortunately without updating the ticket. I'm lowering the urgency to reflect the fact that this is mostly done with the changes Nick already implemented.

As mentioned we already group by hostname (which is a already present label in our setup). I've additionally created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/935 to introduce a common label for every instance of "host_up" alerts and created a follow-up #133991 to cover reducing the amount of alert rule instances per metric.

Actions #12

Updated by nicksinger over 1 year ago

  • Status changed from Feedback to Resolved

I've added another "nested policy" below our general __contacts__ =~ .*"osd-admins".*-policy (which does the grouping by hostname). If the label of an alert is alert = host_up, it will group by alert. Not sure how this behaves in combination with the parent grouping but IIUC this is an "overwrite" so it should not interfere.

Resolving for now. If we see it not working out we can reopen it.

Actions #13

Updated by livdywan over 1 year ago

  • Status changed from Resolved to Workable
  • Assignee deleted (nicksinger)

We decided to re-open because we couldn't tellif this was alerting and maybe it "should have"?

Actions #14

Updated by livdywan about 1 year ago

  • Copied to action #135470: Grafana: Average Ping time (ms) alert with unexpanded variable "${tag_url}", which machine is this about? size:M added
Actions #15

Updated by okurz about 1 year ago

  • Target version changed from Ready to Tools - Next
Actions #16

Updated by okurz about 1 year ago

  • Target version changed from Tools - Next to Ready
Actions #17

Updated by okurz about 1 year ago

  • Related to action #138044: Grouped seemingly unrelated alert emails are confusing size:M added
Actions #18

Updated by okurz about 1 year ago

  • Target version changed from Ready to Tools - Next
Actions #19

Updated by okurz about 1 year ago

  • Target version changed from Tools - Next to Ready
Actions #20

Updated by livdywan 11 months ago

  • Subject changed from Lots of alerts for a single cause. Can we group and de-duplicate? size:M to Lots of alerts for a single cause. Can we group and de-duplicate?
  • Status changed from Workable to New

Maybe we should re-estimate this.

Actions #21

Updated by okurz 11 months ago

  • Status changed from New to Resolved
  • Assignee set to nicksinger

In the daily infra call we tried to estimate this ticket again but we did not understand the comment

We decided to re-open because we couldn't tellif this was alerting and maybe it "should have"?

Also we checked the mailing list and did not found many occurences of grouping at all. Examples like https://mailman.suse.de/mlarch/SuSE/osd-admins/2023/osd-admins.2023.11/msg00120.html with subject "FIRING:2" don't look that helpful but we wouldn't know what to do.

Obviously after this long time if you see a related issue please better reopen a linked ticket with explicit description, not reopen.

Actions

Also available in: Atom PDF