action #133130: Lots of alerts for a single cause. Can we group and de-duplicate? - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #133130

closed

Lots of alerts for a single cause. Can we group and de-duplicate?

Added by okurz almost 2 years ago. Updated over 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

nicksinger

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-07-20

Due date:

% Done:

Estimated time:

Tags:

alert, grafana, infra, improve

Description

Observation¶

Received the following alert emails:

sapworker1: host up alert
sapworker1: OpenQA Ping time alert
sapworker2: host up alert
...
sapworker3: OpenQA Ping time alert
sapworker3: Ping time alert
Average Ping time (ms) alert

all for a singular reason: Problem with the Frankencampus network. Can we group alerts and also not have host up and openQA ping time and ping time alerts?

Acceptance criteria¶

AC1: Grouped alerts, grafana supports this!
AC2: No separate ping time alerts if there is a corresponding host up alert, at least the ping time should come much later than the host up

Suggestions¶

Read https://grafana.com/docs/grafana/latest/alerting/manage-notifications/view-alert-groups/ and https://community.grafana.com/t/how-to-create-alert-group/71327
Look into "grafana alert grouping" and configure alerts accordingly
Crosscheck alerting time threshold, like pick sensible values for "host up" vs. "ping time" or "packet loss".

Rollback steps¶

Remove according silences from https://monitor.qa.suse.de/alerting/silences either referencing this ticket or anything concerning "host up" or "ping time"

Related issues 5 (1 open — 4 closed)

Related to openQA Infrastructure (public) - action #132788: [alert][flaky] QA-Power8-5-kvm: QA network infrastructure Ping time alert

Resolved

okurz

2023-07-15

Actions

Related to openQA Infrastructure (public) - action #133991: Cover same metric for different hosts with a single alert rule

New

2023-07-20

Actions

Related to openQA Infrastructure (public) - action #138044: Grouped seemingly unrelated alert emails are confusing size:M

Rejected

okurz

2023-10-09

Actions

Copied from openQA Infrastructure (public) - action #133127: Frankencampus network broken + GitlabCi failed --> uploading artefacts

Resolved

okurz

2023-07-20

Actions

Copied to openQA Infrastructure (public) - action #135470: Grafana: Average Ping time (ms) alert with unexpanded variable "${tag_url}", which machine is this about? size:M

Resolved

mkittler

2023-07-20

2023-12-13

Actions

Copy link

Updated by okurz almost 2 years ago

Copied from action #133127: Frankencampus network broken + GitlabCi failed --> uploading artefacts added

Actions

Copy link

Updated by okurz almost 2 years ago

Related to action #132788: [alert][flaky] QA-Power8-5-kvm: QA network infrastructure Ping time alert added

Actions

Copy link

Updated by okurz almost 2 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz almost 2 years ago

Subject changed from Lots of alerts for a single cause. Can we group and de-duplicate? to Lots of alerts for a single cause. Can we group and de-duplicate? size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by nicksinger almost 2 years ago

Assignee set to nicksinger

Actions

Copy link

Updated by nicksinger almost 2 years ago

Status changed from Workable to Feedback

I modified our "Notification policy" to group by "hostname". It waits for 30s to send out all alerts by group which might not be enough but I don't know how I could test it. Setting to "feedback" to just wait for now until something bigger breaks.

Actions

Copy link

Updated by livdywan almost 2 years ago

Priority changed from Urgent to Normal

We discussed it a couple times now but unfortunately without updating the ticket. I'm lowering the urgency to reflect the fact that this is mostly done with the changes Nick already implemented.

Actions

Copy link

Updated by nicksinger almost 2 years ago

Copied to action #133991: Cover same metric for different hosts with a single alert rule added

Actions

Copy link

Updated by nicksinger almost 2 years ago

Copied to deleted (action #133991: Cover same metric for different hosts with a single alert rule)

Actions

Copy link

#10

Updated by nicksinger almost 2 years ago

Related to action #133991: Cover same metric for different hosts with a single alert rule added

Actions

Copy link

#11

Updated by nicksinger almost 2 years ago

livdywan wrote:

We discussed it a couple times now but unfortunately without updating the ticket. I'm lowering the urgency to reflect the fact that this is mostly done with the changes Nick already implemented.

As mentioned we already group by hostname (which is a already present label in our setup). I've additionally created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/935 to introduce a common label for every instance of "host_up" alerts and created a follow-up #133991 to cover reducing the amount of alert rule instances per metric.

Actions

Copy link

#12

Updated by nicksinger almost 2 years ago

Status changed from Feedback to Resolved

I've added another "nested policy" below our general __contacts__ =~ .*"osd-admins".*-policy (which does the grouping by hostname). If the label of an alert is alert = host_up, it will group by alert. Not sure how this behaves in combination with the parent grouping but IIUC this is an "overwrite" so it should not interfere.

Resolving for now. If we see it not working out we can reopen it.

Actions

Copy link

#13

Updated by livdywan almost 2 years ago

Status changed from Resolved to Workable
Assignee deleted (~~nicksinger~~)

We decided to re-open because we couldn't tellif this was alerting and maybe it "should have"?

Actions

Copy link

#14

Updated by livdywan over 1 year ago

Copied to action #135470: Grafana: Average Ping time (ms) alert with unexpanded variable "${tag_url}", which machine is this about? size:M added

Actions

Copy link

#15

Updated by okurz over 1 year ago

Target version changed from Ready to Tools - Next

Actions

Copy link

#16

Updated by okurz over 1 year ago

Target version changed from Tools - Next to Ready

Actions

Copy link

#17

Updated by okurz over 1 year ago

Related to action #138044: Grouped seemingly unrelated alert emails are confusing size:M added

Actions

Copy link

#18

Updated by okurz over 1 year ago

Target version changed from Ready to Tools - Next

Actions

Copy link

#19

Updated by okurz over 1 year ago

Target version changed from Tools - Next to Ready

Actions

Copy link

#20

Updated by livdywan over 1 year ago

Subject changed from Lots of alerts for a single cause. Can we group and de-duplicate? size:M to Lots of alerts for a single cause. Can we group and de-duplicate?
Status changed from Workable to New

Maybe we should re-estimate this.

Actions

Copy link

#21

Updated by okurz over 1 year ago

Status changed from New to Resolved
Assignee set to nicksinger

In the daily infra call we tried to estimate this ticket again but we did not understand the comment

We decided to re-open because we couldn't tellif this was alerting and maybe it "should have"?

Also we checked the mailing list and did not found many occurences of grouping at all. Examples like https://mailman.suse.de/mlarch/SuSE/osd-admins/2023/osd-admins.2023.11/msg00120.html with subject "[FIRING:2] (Salt)" don't look that helpful but we wouldn't know what to do.

Obviously after this long time if you see a related issue please better reopen a linked ticket with explicit description, not reopen.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #133130

Lots of alerts for a single cause. Can we group and de-duplicate?

Observation¶

Acceptance criteria¶

Suggestions¶

Rollback steps¶

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by nicksinger almost 2 years ago

Updated by nicksinger almost 2 years ago

Updated by livdywan almost 2 years ago

Updated by nicksinger almost 2 years ago

Updated by nicksinger almost 2 years ago

Updated by nicksinger almost 2 years ago

Updated by nicksinger almost 2 years ago

Updated by nicksinger almost 2 years ago

Updated by livdywan almost 2 years ago

Updated by livdywan over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by livdywan over 1 year ago

Updated by okurz over 1 year ago