Limit execution time of hook scripts run within Minion
Today we've seen that a few
finalize_job_results blocked the whole Minion job queue for quite a while (until manually aborted) because the command
grep -qPzo '(?s)Gru job failed.*connection error.*Inactivity timeout' from the hook script
openqa-label-known-issues kept the Minion workers busy.
- AC1: Hook scripts are aborted after a configurable timeout.
- I'm not sure what makes these openQA jobs which take so long to be investigated special but e.g. https://openqa.suse.de/tests/5527320 is one of them.
openqa-label-known-issuescan be made more efficient as well. Note that the mentioned
grepcommand actually caused a considerable CPU usage so the script wasn't just waiting for something.
I'll also add an upstream feature to Minion to help with this, a fast lane for high priority jobs, since it's a pretty common issue to have a bunch of very slow jobs clogging the queue. Then we can use
--jobs 12 --spare 4 and low priority jobs will only use the first 12 slots, while 4 would always be reserved for high priority jobs.
Upstream feature for Minion has been added now and should reach Factory soon. I also wrote a blog post about it. https://dev.to/kraih/high-priority-fast-lane-for-the-minion-job-queue-4711
Note that for the cleanup we already limit the number of concurrently running jobs. The actual problem was the hook script execution (for automatic job investigation).
I'll check how to make use of the new Minion feature. Maybe we need to lower/increase the priority of some job types.