action #96684: Abort asset download via the cache service when related job runs into a timeout (or is otherwise cancelled) size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

action #96684

closed

coordination #103944: [saga][epic] Scale up: More robust handling of diverse infrastructure with varying performance

coordination #98463: [epic] Avoid too slow asset downloads leading to jobs exceeding the timeout with or run into auto_review:"(timeout: setup exceeded MAX_SETUP_TIME|Cache service queue already full)":retry

Abort asset download via the cache service when related job runs into a timeout (or is otherwise cancelled) size:M

Added by mkittler almost 4 years ago. Updated over 1 year ago.

Status:

Rejected

Priority:

Low

Assignee:

mkittler

Category:

Feature requests

Target version:

Ready

Start date:

2021-08-09

Due date:

% Done:

Estimated time:

Description

Motivation¶

If jobs run into MAX_SETUP_TIME (like we've seen in #96557) or are otherwise cancelled the Minion jobs for asset downloads are not cancelled. That means the worker is unlikely to get out of the situation of being overloaded with too many asset download tasks on its own. Stopping inactive or even active Minion jobs for asset downloads when the related openQA jobs have been cancelled would help with that situation.

Acceptance criteria¶

AC1: Inactive (or even active) Minion jobs are cancelled if the related openQA job is cancelled.
AC2: A Minion job can be responsible for multiple openQA jobs (if they share the same assets). This should still work so the cancellation (AC1) should only happen if no other openQA job requires the Minion job.
AC3: No partial files (or even stale data in the database) are left behind.

Suggestions¶

Cache service downloads are deduplicated, so make sure no downloads are cancelled that are still required by other openQA jobs on the same worker (might require a new sqlite table to keep track of cancelled jobs)
Increase or remove the cache service backlog limit once download cancellation is implemented
Use Minion feature to terminate active jobs properly
Ensure terminating jobs shut down gracefully
Make sure not to corrupt the SQLite database by terminating the job too aggressively
Maybe use USR1/USR2 signals to run custom termination code in the job

Related issues 3 (0 open — 3 closed)

Actions

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #96684

Abort asset download via the cache service when related job runs into a timeout (or is otherwise cancelled) size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by mkittler almost 4 years ago

Updated by tinita almost 4 years ago

Updated by mkittler over 3 years ago

Updated by okurz over 2 years ago

Updated by mkittler about 2 years ago

Updated by okurz about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by okurz about 2 years ago

Updated by kraih almost 2 years ago

Updated by kraih over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago · Edited