Project

General

Profile

Actions

action #89224

closed

Limit execution time of hook scripts run within Minion

Added by mkittler almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2021-02-26
Due date:
2021-03-16
% Done:

0%

Estimated time:

Description

motivation

Today we've seen that a few finalize_job_results blocked the whole Minion job queue for quite a while (until manually aborted) because the command grep -qPzo '(?s)Gru job failed.*connection error.*Inactivity timeout' from the hook script openqa-label-known-issues kept the Minion workers busy.

acceptance criteria

  • AC1: Hook scripts are aborted after a configurable timeout.

further notes

  • I'm not sure what makes these openQA jobs which take so long to be investigated special but e.g. https://openqa.suse.de/tests/5527320 is one of them.
  • Maybe openqa-label-known-issues can be made more efficient as well. Note that the mentioned grep command actually caused a considerable CPU usage so the script wasn't just waiting for something.
Actions #1

Updated by okurz almost 4 years ago

  • Target version set to Ready
Actions #2

Updated by mkittler almost 4 years ago

  • Status changed from New to Workable
  • Assignee set to mkittler
Actions #3

Updated by kraih almost 4 years ago

I'll also add an upstream feature to Minion to help with this, a fast lane for high priority jobs, since it's a pretty common issue to have a bunch of very slow jobs clogging the queue. Then we can use --jobs 12 --spare 4 and low priority jobs will only use the first 12 slots, while 4 would always be reserved for high priority jobs.

Actions #4

Updated by openqa_review almost 4 years ago

  • Due date set to 2021-03-16

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by mkittler almost 4 years ago

  • Status changed from Workable to In Progress
Actions #6

Updated by mkittler almost 4 years ago

  • Status changed from In Progress to Feedback

PR has been merged

Actions #7

Updated by kraih almost 4 years ago

Upstream feature for Minion has been added now and should reach Factory soon. I also wrote a blog post about it. https://dev.to/kraih/high-priority-fast-lane-for-the-minion-job-queue-4711

Actions #8

Updated by okurz almost 4 years ago

awesome blog post!

Actions #9

Updated by mkittler almost 4 years ago

Note that for the cleanup we already limit the number of concurrently running jobs. The actual problem was the hook script execution (for automatic job investigation).

I'll check how to make use of the new Minion feature. Maybe we need to lower/increase the priority of some job types.

Actions #11

Updated by mkittler almost 4 years ago

The new Minion version is in Factory and in our repos for Leap. I'll keep the ticket in feedback to wait until it is deployed in production.

Actions #12

Updated by mkittler almost 4 years ago

  • Status changed from Feedback to Resolved

The Minion dashboard on OSD shows now 2 spare workers. Together with the other changes this should make long-running hook scripts harmless.

Actions

Also available in: Atom PDF