Project

General

Profile

Actions

action #96684

open

coordination #103944: [saga][epic] Scale up: More robust handling of diverse infrastructure with varying performance

coordination #98463: [epic] Avoid too slow asset downloads leading to jobs exceeding the timeout with or run into auto_review:"(timeout: setup exceeded MAX_SETUP_TIME|Cache service queue already full)":retry

Abort asset download via the cache service when related job runs into a timeout (or is otherwise cancelled) size:M

Added by mkittler about 2 years ago. Updated 13 days ago.

Status:
Workable
Priority:
Low
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2021-08-09
Due date:
% Done:

0%

Estimated time:
Difficulty:
hard

Description

Motivation

If jobs run into MAX_SETUP_TIME (like we've seen in #96557) or are otherwise cancelled the Minion jobs for asset downloads are not cancelled. That means the worker is unlikely to get out of the situation of being overloaded with too many asset download tasks on its own. Stopping inactive or even active Minion jobs for asset downloads when the related openQA jobs have been cancelled would help with that situation.

Acceptance criteria

  • AC1: Inactive (or even active) Minion jobs are cancelled if the related openQA job is cancelled.
  • AC2: A Minion job can be responsible for multiple openQA jobs (if they share the same assets). This should still work so the cancellation (AC1) should only happen if no other openQA job requires the Minion job.
  • AC3: No partial files (or even stale data in the database) are left behind.

Suggestions

  • Cache service downloads are deduplicated, so make sure no downloads are cancelled that are still required by other openQA jobs on the same worker (might require a new sqlite table to keep track of cancelled jobs)
  • Increase or remove the cache service backlog limit once download cancellation is implemented
  • Use Minion feature to terminate active jobs properly
  • Ensure terminating jobs shut down gracefully
  • Make sure not to corrupt the SQLite database by terminating the job too aggressively
  • Maybe use USR1/USR2 signals to run custom termination code in the job

Related issues 2 (0 open2 closed)

Related to openQA Project - action #96557: jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retryResolvedmkittler2021-08-042021-08-19

Actions
Copied to openQA Project - action #128267: Restarting jobs (e.g. due to full cache queue) can lead to weird behavior for certain job dependencies (was: Ensure that the incomplete jobs with "cache service full" are properly restarted (take 2)) size:MResolvedmkittler2023-05-10

Actions
Actions #1

Updated by mkittler about 2 years ago

  • Related to action #96557: jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry added
Actions #2

Updated by tinita about 2 years ago

  • Target version set to future
Actions #3

Updated by mkittler about 2 years ago

  • Parent task set to #98463
Actions #4

Updated by okurz 7 months ago

  • Category set to Feature requests
  • Priority changed from Normal to High
  • Target version changed from future to Ready

I don't know if this really fixes the problem mentioned in the epic. This should be revisited

Actions #5

Updated by mkittler 7 months ago

  • Description updated (diff)
  • Difficulty set to hard
Actions #6

Updated by okurz 7 months ago

  • Priority changed from High to Low
  • Target version changed from Ready to future

We are looking into #125276 first

Actions #7

Updated by mkittler 5 months ago

  • Subject changed from Abort asset download via the cache service when related job runs into a timeout (or is otherwise cancelled) to Abort asset download via the cache service when related job runs into a timeout (or is otherwise cancelled) size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #8

Updated by mkittler 5 months ago

  • Target version changed from future to Ready
Actions #9

Updated by okurz 5 months ago

  • Copied to action #128267: Restarting jobs (e.g. due to full cache queue) can lead to weird behavior for certain job dependencies (was: Ensure that the incomplete jobs with "cache service full" are properly restarted (take 2)) size:M added
Actions #10

Updated by kraih about 1 month ago

  • Assignee set to kraih
Actions #11

Updated by kraih 21 days ago

  • Assignee deleted (kraih)
Actions #12

Updated by okurz 13 days ago

  • Target version changed from Ready to Tools - Next
Actions

Also available in: Atom PDF