Project

General

Profile

Actions

action #128267

closed

coordination #103944: [saga][epic] Scale up: More robust handling of diverse infrastructure with varying performance

coordination #98463: [epic] Avoid too slow asset downloads leading to jobs exceeding the timeout with or run into auto_review:"(timeout: setup exceeded MAX_SETUP_TIME|Cache service queue already full)":retry

Restarting jobs (e.g. due to full cache queue) can lead to weird behavior for certain job dependencies (was: Ensure that the incomplete jobs with "cache service full" are properly restarted (take 2)) size:M

Added by okurz about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
2023-05-10
% Done:

0%

Estimated time:

Description

Observation

Still lot of "cache queue full" errors, reported in https://suse.slack.com/archives/C02CANHLANP/p1682406454494569 by dimstar:

(Dominique Leuenberger) Seems this kind of error is back (or more active agani as it used to be in the last few weeks: https://openqa.opensuse.org/tests/3243495
Reason: asset failure: Failed to download opensuse-Tumbleweed-x86_64-20230424-textmode@64bit.qcow2 to /var/lib/openqa/cache/openqa1-opensuse/opensuse-Tumbleweed-x86_64-20230424-textmode@64bit.qcow2; I thought it was addressed? (at least it felt like, as it dod not appear for a while now. Might just have been lucky though)
(Dominique Leuenberger) The start of the fail chain seems to be in https://openqa.opensuse.org/tests/3243518
Reason: cache failure: Cache service queue already full (5)
Cloned as 3243726
(the auto-cloine not taking the children into account is known and unfixed)
(Fabian Vogt) This "Cache service queue already full" error is highly annoying
Every time a worker starts with a clear cache the first dozen tests fail with that
Maybe the queue just needs to be grown 10x or something...
(Dominique Leuenberger) ah, then the luck was probably that the snapshot moved to QA in the late evening, not early morning; so I happened to not be the first consumer

Acceptance criteria

  • AC1: Restarting one of two independent root jobs (only related indirectly via parallel dependency) is handled well (no job ends up as parallel_failed when it has no direct parallel dependencies, no chained children are executed without their parent being successful)
  • AC2: Restarting jobs (e.g. due to full cache queue) is generally handled well. So use cases similar to AC1 are also covered.

Suggestions

  • Understand why #125276 could not fix the problem
  • Make sure jobs really restart if the cache service queue is full
  • Double- and triple-check jobs visible on https://openqa.opensuse.org
  • Get in touch with dimstar+fvogt to ensure the problem is fully addressed

Files


Related issues 2 (0 open2 closed)

Copied from openQA Project - action #96684: Abort asset download via the cache service when related job runs into a timeout (or is otherwise cancelled) size:MRejectedmkittler2021-08-09

Actions
Copied to openQA Project - action #128276: Handle workers with busy cache service gracefully by a two-level wait size:MResolvedmkittler2023-04-25

Actions
Actions

Also available in: Atom PDF