Project

General

Profile

action #98673

[retro] Unhandled alert about job queue for nearly a day, users brought it up in chat, should have been picked up sooner size:S

Added by okurz 2 months ago. Updated about 2 months ago.

Status:
Blocked
Priority:
Low
Assignee:
Target version:
Start date:
2021-09-20
Due date:
2022-02-04
% Done:

50%

Estimated time:
(Total: 0.00 h)

Description

Observation

https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=9&from=1631611449633&to=1631662741852 shows that alerts were triggered for a long job queue at 2021-09-14 10:14:32Z with 3100 jobs blocked. There was no reaction on the alert until I (okurz) reminded about it multiple times and we finally discussed it in more detail on 2021-09-15 0900Z . I think we can do better than that.

Goals

  • We react on alerts before users bring it up in chat and "surprise" us
  • We proactively inform potentially impacted users (before they tell us)
  • We follow our alert handling process

Acceptance criteria

  • AC1 A team decision is documented

Suggestions

  • Create tickets (instead of alert emails)
  • Slack alerts (instead of alert emails)

Further details

Our alert handling already describes what we should do, we just don't follow it within the timeframe until first users bring it up in chat asking what is going on.
Please also see
https://confluence.suse.com/display/~hrommel1/Communication+Plan+for+openQA+Outages stating requirements for communication which we would fulfill by following above goals but so far are not doing it very well.


Subtasks

action #98916: Improve alert handling - weekly alert dutyResolvedcdywan

action #98919: Improve alert handling - slack notificationsFeedbacknicksinger


Related issues

Copied from QA - action #98667: Unhandled [Alerting] Queue: State (SUSE) alert for > 4h size:MResolved2021-09-152021-09-29

Copied to QA - action #98916: Improve alert handling - weekly alert dutyResolved2021-10-04

History

#1 Updated by okurz 2 months ago

  • Copied from action #98667: Unhandled [Alerting] Queue: State (SUSE) alert for > 4h size:M added

#2 Updated by cdywan 2 months ago

  • Subject changed from [retro] Unhandled alert about job queue for nearly a day, users brought it up in chat, should have been picked up sooner to [retro] Unhandled alert about job queue for nearly a day, users brought it up in chat, should have been picked up sooner size:S
  • Description updated (diff)
  • Status changed from New to Workable

#3 Updated by cdywan 2 months ago

  • Description updated (diff)

#4 Updated by Xiaojing_liu 2 months ago

I prefer alert emails

#5 Updated by cdywan 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to cdywan

Since we didn't come to a decision yet, I volunteer to be the first to try out an "alert duty" experiment. Going by the order of team members in the wiki, every week we have one person check that alerts are handled. If nobody else is paying attention, this person is the one to make sure we handle all alerts, or problems reported otherwise.

#6 Updated by okurz 2 months ago

  • Due date set to 2021-10-04

I hope others still react as they can. For the alert duty I consider it important to stay responsive on shorter time scales, e.g. if everybody else is "in the zone" for hours happily coding then the person on alert duty should proactively look if there is something to react upon

#7 Updated by okurz 2 months ago

  • Copied to action #98916: Improve alert handling - weekly alert duty added

#8 Updated by okurz 2 months ago

I split the ticket into subtickets as we already suggest slack notifications there. nsinger and cdywan can continue in the specific subtickets

#9 Updated by cdywan 2 months ago

  • Status changed from In Progress to Blocked

Setting to Blocked since this is waiting on subtasks. I also plan to bring this up in the retro this Friday

Also available in: Atom PDF