Actions
action #135239
closedConduct lessons learned "Five Why" analysis for "OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert" size:M
Start date:
2023-09-04
Due date:
2023-09-30
% Done:
0%
Estimated time:
Description
Observation¶
- User reports on September 4 about few running jobs
Questions¶
- Why was nobody in the team aware of the issue before user reports?
- Because we don't have any alerts defined for "no jobs running" -> DONE Define an alert (first issue was only one job running) for "way too few jobs, just a handful" -> https://progress.opensuse.org/issues/135380#note-4
- We have alerts for queue size and job age but we did not hit the threshold yet -> Consider lowering the threshold for job age and job queue alerts -> #136952
- Also because our job age alert is actually wrongly defined -> #135008
- Why did we struggle to understand the underlying problem?
- Because we did not see any error or warning in the logs regarding the scheduler not being able to schedule any jobs for longer spans of time
- Why weren't we aware that we have so many more workers?
- Maybe not everybody was properly following the plan in the backlog regarding setup of new workers? Yes, maybe, but nothing that we can think of what to improve -> #135362
- Likely having historic data regarding number of workers, offline, online, idle, in grafana would have helped to find the "last good" reference -> #136958
- Why did the scheduler not complain that it was blocked for too long or is too slow to finish within timeout?
- There were no warnings because of the very long inactivity timeout of 10 minutes for HTTP requests from the scheduler to the websocket server -> look into shorter timeouts and a good error message -> #136961
- Even if there would be an error in the log it would be in the journal and we would not be notified about that -> look up the ticket about logwarn for OSD -> #97544 #97247 #57239
- Why have we forgotten about #110833 "Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances"?
- Because it's in "future" -> We already have planned to split weekly+retro and reserve more time for the future outlook, in particular the saga level, see #135023
Acceptance criteria¶
- AC1: A Five-Whys analysis has been conducted and results documented
- AC2: Improvements are planned
Suggestions¶
- Bring up in retro
- Conduct "Five-Whys" analysis for the topic
- Identify follow-up tasks in tickets
- Organize a call to conduct the 5 whys
Actions