action #135239
Updated by tinita about 1 year ago
## Observation * [User reports on September 4](https://suse.slack.com/archives/C02CANHLANP/p1693798894476509) about few running jobs ## Questions 1. Why was nobody in the team aware of the issue before user reports? * Because we don't have any alerts defined for "no jobs running" -> DONE TODO Define an alert (first issue was only one job running) for "way too few jobs, just a handful" -> https://progress.opensuse.org/issues/135380#note-4 * We have alerts for queue size and job age but we did not hit the threshold yet -> TODO Consider lowering the threshold for job age and job queue alerts * Also because our job age alert is actually wrongly defined -> #135008 2. Why did we struggle to understand the underlying problem? * Because we did not see any error or warning in the logs regarding the scheduler not being able to schedule any jobs for longer spans of time 3. Why weren't we aware that we have so many more workers? * Maybe not everybody was properly following the plan in the backlog regarding setup of new workers? Yes, maybe, but nothing that we can think of what to improve -> #135362 * TODO Likely having historic data regarding number of workers, offline, online, idle, in grafana would have helped to find the "last good" reference 4. Why did the scheduler not complain that it was blocked for too long or is too slow to finish within timeout? * There were no warnings because of the very long inactivity timeout of 10 minutes for HTTP requests from the scheduler to the websocket server -> TODO look into shorter timeouts and a good error message * Even if there would be an error in the log it would be in the journal and we would not be notified about that -> TODO look up the ticket about logwarn for OSD 5. Why have we forgotten about #110833 "Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances"? * Because it's in "future" -> We already have planned to split weekly+retro and reserve more time for the future outlook, in particular the saga level, see #135023 ## Acceptance criteria * **AC1:** A [Five-Whys](https://en.wikipedia.org/wiki/Five_whys) analysis has been conducted and results documented * **AC2:** Improvements are planned ## Suggestions * Bring up in retro * Conduct "Five-Whys" analysis for the topic * Identify follow-up tasks in tickets * Organize a call to conduct the 5 whys