Project

General

Profile

Actions

action #98673

closed

[retro] Unhandled alert about job queue for nearly a day, users brought it up in chat, should have been picked up sooner size:S

Added by okurz over 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Observation

https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=9&from=1631611449633&to=1631662741852 shows that alerts were triggered for a long job queue at 2021-09-14 10:14:32Z with 3100 jobs blocked. There was no reaction on the alert until I (okurz) reminded about it multiple times and we finally discussed it in more detail on 2021-09-15 0900Z . I think we can do better than that.

Goals

  • We react on alerts before users bring it up in chat and "surprise" us
  • We proactively inform potentially impacted users (before they tell us)
  • We follow our alert handling process

Acceptance criteria

  • AC1 A team decision is documented

Suggestions

  • Create tickets (instead of alert emails)
  • Slack alerts (instead of alert emails)

Further details

Our alert handling already describes what we should do, we just don't follow it within the timeframe until first users bring it up in chat asking what is going on.
Please also see
https://confluence.suse.com/display/~hrommel1/Communication+Plan+for+openQA+Outages stating requirements for communication which we would fulfill by following above goals but so far are not doing it very well.


Subtasks 1 (0 open1 closed)

action #98916: Improve alert handling - weekly alert dutyResolvedlivdywan

Actions

Related issues 2 (0 open2 closed)

Copied from QA - action #98667: Unhandled [Alerting] Queue: State (SUSE) alert for > 4h size:MResolvedmkittler2021-09-152021-09-29

Actions
Copied to QA - action #98916: Improve alert handling - weekly alert dutyResolvedlivdywan

Actions
Actions #1

Updated by okurz over 2 years ago

  • Copied from action #98667: Unhandled [Alerting] Queue: State (SUSE) alert for > 4h size:M added
Actions #2

Updated by livdywan over 2 years ago

  • Subject changed from [retro] Unhandled alert about job queue for nearly a day, users brought it up in chat, should have been picked up sooner to [retro] Unhandled alert about job queue for nearly a day, users brought it up in chat, should have been picked up sooner size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by livdywan over 2 years ago

  • Description updated (diff)
Actions #4

Updated by Xiaojing_liu over 2 years ago

I prefer alert emails

Actions #5

Updated by livdywan over 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan

Since we didn't come to a decision yet, I volunteer to be the first to try out an "alert duty" experiment. Going by the order of team members in the wiki, every week we have one person check that alerts are handled. If nobody else is paying attention, this person is the one to make sure we handle all alerts, or problems reported otherwise.

Actions #6

Updated by okurz over 2 years ago

  • Due date set to 2021-10-04

I hope others still react as they can. For the alert duty I consider it important to stay responsive on shorter time scales, e.g. if everybody else is "in the zone" for hours happily coding then the person on alert duty should proactively look if there is something to react upon

Actions #7

Updated by okurz over 2 years ago

  • Copied to action #98916: Improve alert handling - weekly alert duty added
Actions #8

Updated by okurz over 2 years ago

I split the ticket into subtickets as we already suggest slack notifications there. nsinger and cdywan can continue in the specific subtickets

Actions #9

Updated by livdywan over 2 years ago

  • Status changed from In Progress to Blocked

Setting to Blocked since this is waiting on subtasks. I also plan to bring this up in the retro this Friday

Actions #10

Updated by okurz about 2 years ago

  • Status changed from Blocked to Resolved

We moved #98919 to future and decided that we can go without so I removed #98919 from being a subtask to here and we can conclude because we feel that weekly alert duty is sufficient to achieve the goal.

Actions

Also available in: Atom PDF