Project

General

Profile

Actions

action #133130

closed

Lots of alerts for a single cause. Can we group and de-duplicate?

Added by okurz about 1 year ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2023-07-20
Due date:
% Done:

0%

Estimated time:

Description

Observation

Received the following alert emails:

  • sapworker1: host up alert
  • sapworker1: OpenQA Ping time alert
  • sapworker2: host up alert
  • ...
  • sapworker3: OpenQA Ping time alert
  • sapworker3: Ping time alert
  • Average Ping time (ms) alert

all for a singular reason: Problem with the Frankencampus network. Can we group alerts and also not have host up and openQA ping time and ping time alerts?

Acceptance criteria

  • AC1: Grouped alerts, grafana supports this!
  • AC2: No separate ping time alerts if there is a corresponding host up alert, at least the ping time should come much later than the host up

Suggestions

Rollback steps


Related issues 5 (1 open4 closed)

Related to openQA Infrastructure - action #132788: [alert][flaky] QA-Power8-5-kvm: QA network infrastructure Ping time alertResolvedokurz2023-07-15

Actions
Related to openQA Infrastructure - action #133991: Cover same metric for different hosts with a single alert ruleNew2023-07-20

Actions
Related to openQA Infrastructure - action #138044: Grouped seemingly unrelated alert emails are confusing size:MRejectedokurz2023-10-09

Actions
Copied from openQA Infrastructure - action #133127: Frankencampus network broken + GitlabCi failed --> uploading artefactsResolvedokurz2023-07-20

Actions
Copied to openQA Infrastructure - action #135470: Grafana: Average Ping time (ms) alert with unexpanded variable "${tag_url}", which machine is this about? size:MResolvedmkittler2023-07-202023-12-13

Actions
Actions

Also available in: Atom PDF