Project

General

Profile

Actions

action #128276

closed

coordination #103944: [saga][epic] Scale up: More robust handling of diverse infrastructure with varying performance

coordination #98463: [epic] Avoid too slow asset downloads leading to jobs exceeding the timeout with or run into auto_review:"(timeout: setup exceeded MAX_SETUP_TIME|Cache service queue already full)":retry

Handle workers with busy cache service gracefully by a two-level wait size:M

Added by okurz about 1 year ago. Updated 3 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Feature requests
Target version:
Start date:
2023-04-25
Due date:
% Done:

0%

Estimated time:

Description

Motivation

As observed in #128267 "cache service full" jobs are still too visible to users and we should reconsider propagating the error that easily to the job level. The idea that mkittler and me had is to ensure that scheduled jobs are only assigned to workers where "cache queue size < cache_queue_limit_concurrent" (timeout is the timeout of some days when scheduled jobs get cancelled implicitly). AFAIK this is already present in the current implementation but should be cross-checked. If the condition is fulfilled and a job is assigned to a worker it can still happen that the cache queue becomes "too long" due to other jobs or the original job adding to the cache queue. This is why in this case jobs assigned to a worker should wait for other download jobs to complete before continuing. Here the timeout is MAX_SETUP_TIME. There should be the second limit "cache_queue_limit_max" (is that the current limit of 5 causing "Cache service queue already full"?) when exceeded incomplete the job as we do but with a more conservative limit, e.g. 50.

Acceptance criteria

  • AC1: During normal operation on o3 including rebooting o3 workers and "busy days" no "cache service full" incompletes happen

Acceptance tests

  • AT1-1: Given no jobs running on a worker And clean cache When a job is scheduled Then the job is immediately picked up by the worker, cache service requests are executed, job starts
  • AT1-2: Given jobs running on a worker with assigned cache service requests exceeding "cache_queue_limit_concurrent" When a job is scheduled Then the job is not immediately picked up by the worker
  • AT1-3: Given jobs running on a worker with assigned cache service requests exceeding "cache_queue_limit_concurrent-1" When jobs are still forcefully assigned to the worker (simulating unfortunate timing aka. "race condition") Then the job assigned cache service request waits for a free slot
  • AT1-4: Given jobs running on a worker with assigned cache service requests with cache service requests When exceeding "cache_queue_limit_max" Then jobs incomplete as in before with "Cache service queue already full" (likely rephrased a bit)

Suggestions

  • Review existing implementation of the check that is done before assigning jobs to workers
  • Ensure defaults for the both limit are far enough apart, e.g. "cache_queue_limit_concurrent" = 3, "cache_queue_limit_max" = 50
  • Execute ATs to see where implementation is missing

Related issues 1 (0 open1 closed)

Copied from openQA Project - action #128267: Restarting jobs (e.g. due to full cache queue) can lead to weird behavior for certain job dependencies (was: Ensure that the incomplete jobs with "cache service full" are properly restarted (take 2)) size:MResolvedmkittler2023-05-10

Actions
Actions

Also available in: Atom PDF