Project

General

Profile

action #89224

Limit execution time of hook scripts run within Minion

Added by mkittler 5 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2021-02-26
Due date:
2021-03-16
% Done:

0%

Estimated time:
Difficulty:

Description

motivation

Today we've seen that a few finalize_job_results blocked the whole Minion job queue for quite a while (until manually aborted) because the command grep -qPzo '(?s)Gru job failed.*connection error.*Inactivity timeout' from the hook script openqa-label-known-issues kept the Minion workers busy.

acceptance criteria

  • AC1: Hook scripts are aborted after a configurable timeout.

further notes

  • I'm not sure what makes these openQA jobs which take so long to be investigated special but e.g. https://openqa.suse.de/tests/5527320 is one of them.
  • Maybe openqa-label-known-issues can be made more efficient as well. Note that the mentioned grep command actually caused a considerable CPU usage so the script wasn't just waiting for something.

History

#1 Updated by okurz 5 months ago

  • Target version set to Ready

#2 Updated by mkittler 5 months ago

  • Status changed from New to Workable
  • Assignee set to mkittler

#3 Updated by kraih 5 months ago

I'll also add an upstream feature to Minion to help with this, a fast lane for high priority jobs, since it's a pretty common issue to have a bunch of very slow jobs clogging the queue. Then we can use --jobs 12 --spare 4 and low priority jobs will only use the first 12 slots, while 4 would always be reserved for high priority jobs.

#4 Updated by openqa_review 5 months ago

  • Due date set to 2021-03-16

Setting due date based on mean cycle time of SUSE QE Tools

#5 Updated by mkittler 5 months ago

  • Status changed from Workable to In Progress

#6 Updated by mkittler 5 months ago

  • Status changed from In Progress to Feedback

PR has been merged

#7 Updated by kraih 5 months ago

Upstream feature for Minion has been added now and should reach Factory soon. I also wrote a blog post about it. https://dev.to/kraih/high-priority-fast-lane-for-the-minion-job-queue-4711

#8 Updated by okurz 5 months ago

awesome blog post!

#9 Updated by mkittler 5 months ago

Note that for the cleanup we already limit the number of concurrently running jobs. The actual problem was the hook script execution (for automatic job investigation).

I'll check how to make use of the new Minion feature. Maybe we need to lower/increase the priority of some job types.

#11 Updated by mkittler 5 months ago

The new Minion version is in Factory and in our repos for Leap. I'll keep the ticket in feedback to wait until it is deployed in production.

#12 Updated by mkittler 4 months ago

  • Status changed from Feedback to Resolved

The Minion dashboard on OSD shows now 2 spare workers. Together with the other changes this should make long-running hook scripts harmless.

Also available in: Atom PDF