action #128276: Handle workers with busy cache service gracefully by a two-level wait size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

action #128276

closed

coordination #103944: [saga][epic] Scale up: More robust handling of diverse infrastructure with varying performance

coordination #98463: [epic] Avoid too slow asset downloads leading to jobs exceeding the timeout with or run into auto_review:"(timeout: setup exceeded MAX_SETUP_TIME|Cache service queue already full)":retry

Handle workers with busy cache service gracefully by a two-level wait size:M

Added by okurz almost 2 years ago. Updated about 1 year ago.

Status:

Resolved

Priority:

Low

Assignee:

mkittler

Category:

Feature requests

Target version:

Ready

Start date:

2023-04-25

Due date:

% Done:

Estimated time:

Description

Motivation¶

As observed in #128267 "cache service full" jobs are still too visible to users and we should reconsider propagating the error that easily to the job level. The idea that mkittler and me had is to ensure that scheduled jobs are only assigned to workers where "cache queue size < cache_queue_limit_concurrent" (timeout is the timeout of some days when scheduled jobs get cancelled implicitly). AFAIK this is already present in the current implementation but should be cross-checked. If the condition is fulfilled and a job is assigned to a worker it can still happen that the cache queue becomes "too long" due to other jobs or the original job adding to the cache queue. This is why in this case jobs assigned to a worker should wait for other download jobs to complete before continuing. Here the timeout is MAX_SETUP_TIME. There should be the second limit "cache_queue_limit_max" (is that the current limit of 5 causing "Cache service queue already full"?) when exceeded incomplete the job as we do but with a more conservative limit, e.g. 50.

Acceptance criteria¶

AC1: During normal operation on o3 including rebooting o3 workers and "busy days" no "cache service full" incompletes happen

Acceptance tests¶

AT1-1: Given no jobs running on a worker And clean cache When a job is scheduled Then the job is immediately picked up by the worker, cache service requests are executed, job starts
AT1-2: Given jobs running on a worker with assigned cache service requests exceeding "cache_queue_limit_concurrent" When a job is scheduled Then the job is not immediately picked up by the worker
AT1-3: Given jobs running on a worker with assigned cache service requests exceeding "cache_queue_limit_concurrent-1" When jobs are still forcefully assigned to the worker (simulating unfortunate timing aka. "race condition") Then the job assigned cache service request waits for a free slot
AT1-4: Given jobs running on a worker with assigned cache service requests with cache service requests When exceeding "cache_queue_limit_max" Then jobs incomplete as in before with "Cache service queue already full" (likely rephrased a bit)

Suggestions¶

Review existing implementation of the check that is done before assigning jobs to workers
Ensure defaults for the both limit are far enough apart, e.g. "cache_queue_limit_concurrent" = 3, "cache_queue_limit_max" = 50
Execute ATs to see where implementation is missing

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #128276

Handle workers with busy cache service gracefully by a two-level wait size:M

Motivation¶

Acceptance criteria¶

Acceptance tests¶

Suggestions¶

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by livdywan almost 2 years ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler about 1 year ago

Updated by mkittler about 1 year ago

Updated by mkittler about 1 year ago

Updated by mkittler about 1 year ago · Edited

Updated by mkittler about 1 year ago · Edited

Updated by mkittler about 1 year ago

Updated by mkittler about 1 year ago

Updated by mkittler about 1 year ago