Project

General

Profile

Actions

action #98673

closed

[retro] Unhandled alert about job queue for nearly a day, users brought it up in chat, should have been picked up sooner size:S

Added by okurz over 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Start date:
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Observation

https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=9&from=1631611449633&to=1631662741852 shows that alerts were triggered for a long job queue at 2021-09-14 10:14:32Z with 3100 jobs blocked. There was no reaction on the alert until I (okurz) reminded about it multiple times and we finally discussed it in more detail on 2021-09-15 0900Z . I think we can do better than that.

Goals

  • We react on alerts before users bring it up in chat and "surprise" us
  • We proactively inform potentially impacted users (before they tell us)
  • We follow our alert handling process

Acceptance criteria

  • AC1 A team decision is documented

Suggestions

  • Create tickets (instead of alert emails)
  • Slack alerts (instead of alert emails)

Further details

Our alert handling already describes what we should do, we just don't follow it within the timeframe until first users bring it up in chat asking what is going on.
Please also see
https://confluence.suse.com/display/~hrommel1/Communication+Plan+for+openQA+Outages stating requirements for communication which we would fulfill by following above goals but so far are not doing it very well.


Subtasks 1 (0 open1 closed)

action #98916: Improve alert handling - weekly alert dutyResolvedlivdywan

Actions

Related issues 2 (0 open2 closed)

Copied from QA (public) - action #98667: Unhandled [Alerting] Queue: State (SUSE) alert for > 4h size:MResolvedmkittler2021-09-152021-09-29

Actions
Copied to QA (public) - action #98916: Improve alert handling - weekly alert dutyResolvedlivdywan

Actions
Actions

Also available in: Atom PDF