Project

General

Profile

Actions

action #135239

closed

Conduct lessons learned "Five Why" analysis for "OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert" size:M

Added by okurz over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2023-09-04
Due date:
2023-09-30
% Done:

0%

Estimated time:

Description

Observation

Questions

  1. Why was nobody in the team aware of the issue before user reports?
    • Because we don't have any alerts defined for "no jobs running" -> DONE Define an alert (first issue was only one job running) for "way too few jobs, just a handful" -> https://progress.opensuse.org/issues/135380#note-4
    • We have alerts for queue size and job age but we did not hit the threshold yet -> Consider lowering the threshold for job age and job queue alerts -> #136952
    • Also because our job age alert is actually wrongly defined -> #135008
  2. Why did we struggle to understand the underlying problem?
    • Because we did not see any error or warning in the logs regarding the scheduler not being able to schedule any jobs for longer spans of time
  3. Why weren't we aware that we have so many more workers?
    • Maybe not everybody was properly following the plan in the backlog regarding setup of new workers? Yes, maybe, but nothing that we can think of what to improve -> #135362
    • Likely having historic data regarding number of workers, offline, online, idle, in grafana would have helped to find the "last good" reference -> #136958
  4. Why did the scheduler not complain that it was blocked for too long or is too slow to finish within timeout?
    • There were no warnings because of the very long inactivity timeout of 10 minutes for HTTP requests from the scheduler to the websocket server -> look into shorter timeouts and a good error message -> #136961
    • Even if there would be an error in the log it would be in the journal and we would not be notified about that -> look up the ticket about logwarn for OSD -> #97544 #97247 #57239
  5. Why have we forgotten about #110833 "Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances"?
    • Because it's in "future" -> We already have planned to split weekly+retro and reserve more time for the future outlook, in particular the saga level, see #135023

Acceptance criteria

  • AC1: A Five-Whys analysis has been conducted and results documented
  • AC2: Improvements are planned

Suggestions

  • Bring up in retro
  • Conduct "Five-Whys" analysis for the topic
  • Identify follow-up tasks in tickets
  • Organize a call to conduct the 5 whys

Related issues 4 (3 open1 closed)

Copied from openQA Project (public) - coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alertResolvedokurz2023-09-07

Actions
Copied to openQA Infrastructure (public) - action #136952: Consider lowering the threshold for job age and job queue alertsNew

Actions
Copied to openQA Infrastructure (public) - action #136958: Add Grafana panel for number of workersNew

Actions
Copied to openQA Project (public) - action #136961: Lower timeouts for HTTP requests from scheduler to websocketNew2023-09-25

Actions
Actions #1

Updated by okurz over 1 year ago

  • Copied from coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert added
Actions #2

Updated by livdywan over 1 year ago

  • Tracker changed from action to coordination
  • Status changed from New to Blocked
Actions #3

Updated by livdywan over 1 year ago

  • Tracker changed from coordination to action
  • Subject changed from Conduct lessons learned "Five Why" analysis for "OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert" to Conduct lessons learned "Five Why" analysis for "OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert" size:M
  • Description updated (diff)
  • Status changed from Blocked to Workable
Actions #4

Updated by okurz over 1 year ago

We conducted the lessons learned meeting. Result in ticket description

Actions #5

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #6

Updated by livdywan over 1 year ago

  • Description updated (diff)
Actions #7

Updated by tinita over 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to tinita
Actions #8

Updated by openqa_review over 1 year ago

  • Due date set to 2023-09-30

Setting due date based on mean cycle time of SUSE QE Tools

Actions #9

Updated by tinita about 1 year ago

  • Description updated (diff)
Actions #10

Updated by tinita about 1 year ago

  • Copied to action #136952: Consider lowering the threshold for job age and job queue alerts added
Actions #11

Updated by tinita about 1 year ago

  • Description updated (diff)
Actions #12

Updated by tinita about 1 year ago

  • Copied to action #136958: Add Grafana panel for number of workers added
Actions #13

Updated by tinita about 1 year ago

  • Copied to action #136961: Lower timeouts for HTTP requests from scheduler to websocket added
Actions #14

Updated by tinita about 1 year ago

  • Description updated (diff)
Actions #15

Updated by tinita about 1 year ago

  • Description updated (diff)
Actions #16

Updated by tinita about 1 year ago

  • Status changed from In Progress to Resolved

Followup tickets created or existing tickets identified.

Actions

Also available in: Atom PDF