action #128267: Restarting jobs (e.g. due to full cache queue) can lead to weird behavior for certain job dependencies (was: Ensure that the incomplete jobs with "cache service full" are properly restarted (take 2)) size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

action #128267

closed

coordination #103944: [saga][epic] Scale up: More robust handling of diverse infrastructure with varying performance

coordination #98463: [epic] Avoid too slow asset downloads leading to jobs exceeding the timeout with or run into auto_review:"(timeout: setup exceeded MAX_SETUP_TIME|Cache service queue already full)":retry

Restarting jobs (e.g. due to full cache queue) can lead to weird behavior for certain job dependencies (was: Ensure that the incomplete jobs with "cache service full" are properly restarted (take 2)) size:M

Added by okurz almost 2 years ago. Updated 9 months ago.

Status:

Resolved

Priority:

Urgent

Assignee:

mkittler

Category:

Feature requests

Target version:

Ready

Start date:

Due date:

% Done:

Estimated time:

Tags:

reactive work

Description

Observation¶

Still lot of "cache queue full" errors, reported in https://suse.slack.com/archives/C02CANHLANP/p1682406454494569 by dimstar:

(Dominique Leuenberger) Seems this kind of error is back (or more active agani as it used to be in the last few weeks: https://openqa.opensuse.org/tests/3243495
Reason: asset failure: Failed to download opensuse-Tumbleweed-x86_64-20230424-textmode@64bit.qcow2 to /var/lib/openqa/cache/openqa1-opensuse/opensuse-Tumbleweed-x86_64-20230424-textmode@64bit.qcow2; I thought it was addressed? (at least it felt like, as it dod not appear for a while now. Might just have been lucky though)
(Dominique Leuenberger) The start of the fail chain seems to be in https://openqa.opensuse.org/tests/3243518
Reason: cache failure: Cache service queue already full (5)
Cloned as 3243726
(the auto-cloine not taking the children into account is known and unfixed)
(Fabian Vogt) This "Cache service queue already full" error is highly annoying
Every time a worker starts with a clear cache the first dozen tests fail with that
Maybe the queue just needs to be grown 10x or something...
(Dominique Leuenberger) ah, then the luck was probably that the snapshot moved to QA in the late evening, not early morning; so I happened to not be the first consumer

Acceptance criteria¶

AC1: Restarting one of two independent root jobs (only related indirectly via parallel dependency) is handled well (no job ends up as parallel_failed when it has no direct parallel dependencies, no chained children are executed without their parent being successful)
AC2: Restarting jobs (e.g. due to full cache queue) is generally handled well. So use cases similar to AC1 are also covered.

Suggestions¶

Understand why #125276 could not fix the problem
Make sure jobs really restart if the cache service queue is full
Double- and triple-check jobs visible on https://openqa.opensuse.org
Get in touch with dimstar+fvogt to ensure the problem is fully addressed

Files

Screenshot 2023-04-25 at 17-09-24 openQA opensuse-Tumbleweed-DVD-x86_64-Build20230424-create_hdd_gnome_encrypt_separate_boot@64bit test results.png (1.94 MB) Screenshot 2023-04-25 at 17-09-24 openQA opensuse-Tumbleweed-DVD-x86_64-Build20230424-create_hdd_gnome_encrypt_separate_boot@64bit test results.png

mkittler, 2023-04-25 15:10

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #128267

Restarting jobs (e.g. due to full cache queue) can lead to weird behavior for certain job dependencies (was: Ensure that the incomplete jobs with "cache service full" are properly restarted (take 2)) size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz almost 2 years ago

Updated by favogt almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by openqa_review almost 2 years ago

Updated by livdywan almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by okurz 9 months ago