action #89224
closedLimit execution time of hook scripts run within Minion
Description
motivation¶
Today we've seen that a few finalize_job_results
blocked the whole Minion job queue for quite a while (until manually aborted) because the command grep -qPzo '(?s)Gru job failed.*connection error.*Inactivity timeout'
from the hook script openqa-label-known-issues
kept the Minion workers busy.
acceptance criteria¶
- AC1: Hook scripts are aborted after a configurable timeout.
further notes¶
- I'm not sure what makes these openQA jobs which take so long to be investigated special but e.g. https://openqa.suse.de/tests/5527320 is one of them.
- Maybe
openqa-label-known-issues
can be made more efficient as well. Note that the mentionedgrep
command actually caused a considerable CPU usage so the script wasn't just waiting for something.
Updated by mkittler almost 4 years ago
- Status changed from New to Workable
- Assignee set to mkittler
Updated by kraih almost 4 years ago
I'll also add an upstream feature to Minion to help with this, a fast lane for high priority jobs, since it's a pretty common issue to have a bunch of very slow jobs clogging the queue. Then we can use --jobs 12 --spare 4
and low priority jobs will only use the first 12 slots, while 4 would always be reserved for high priority jobs.
Updated by openqa_review almost 4 years ago
- Due date set to 2021-03-16
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler almost 4 years ago
- Status changed from Workable to In Progress
Updated by mkittler almost 4 years ago
- Status changed from In Progress to Feedback
PR has been merged
Updated by kraih almost 4 years ago
Upstream feature for Minion has been added now and should reach Factory soon. I also wrote a blog post about it. https://dev.to/kraih/high-priority-fast-lane-for-the-minion-job-queue-4711
Updated by mkittler almost 4 years ago
Note that for the cleanup we already limit the number of concurrently running jobs. The actual problem was the hook script execution (for automatic job investigation).
I'll check how to make use of the new Minion feature. Maybe we need to lower/increase the priority of some job types.
Updated by mkittler almost 4 years ago
Updated by mkittler almost 4 years ago
The new Minion version is in Factory and in our repos for Leap. I'll keep the ticket in feedback to wait until it is deployed in production.
Updated by mkittler almost 4 years ago
- Status changed from Feedback to Resolved
The Minion dashboard on OSD shows now 2 spare workers. Together with the other changes this should make long-running hook scripts harmless.