action #28328: job was triggered trying to download HDD image but it's already gone - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #28328

closed

job was triggered trying to download HDD image but it's already gone

Added by okurz over 7 years ago. Updated about 5 years ago.

Status:

Rejected

Priority:

Normal

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

Start date:

2017-11-24

Due date:

% Done:

Estimated time:

Description

Observation¶

https://openqa.suse.de/tests/1269748/file/autoinst-log.txt

start time: 2017-11-24 07:09:39
…
CACHE: Download of /var/lib/openqa/cache/SLES-15-aarch64-349.1@aarch64-minimal_with_sdk349.1_installed.qcow2 failed with: 404 - Not Found
+++ worker notes +++
end time: 2017-11-24 07:09:40
result: setup failure: Can't download SLES-15-aarch64-349.1@aarch64-minimal_with_sdk349.1_installed.qcow2

and from the parent:

end time: 2017-11-23 17:20:51
uploading install_and_reboot-y2logs.tar.bz2
uploading SLES-15-aarch64-349.1@aarch64-minimal_with_sdk349.1_installed.qcow2
Checksum comparison (actual:expected) 1032847561:1032847561 with size (actual:expected) 836435968:836435968

osd:openqa-gru:

[Fri Nov 24 02:53:09 2017] [28989:info] GRU: removing /var/lib/openqa/share/factory/hdd/SLES-15-aarch64-349.1@aarch64-minimal_with_sdk349.1_installed.qcow2

So it was deleted as expected after the parent job uploaded it but before the downstream job had a chance to act on it.

Problem¶

Isn't the asset being marked as "used" by a scheduled job to prevent GRU from cleaning that up?

Related issues 6 (1 open — 5 closed)

Related to openQA Project (public) - action #19672: GRU may delete assets while jobs are registered

Resolved

coolo

2017-06-08

Actions

Related to openQA Project (public) - action #16496: [tools][sprint 201711.2] display current disk space consumption of job groups

Resolved

mkittler

2017-02-06

Actions

Related to openQA Tests (public) - coordination #25380: [sle][functional][epic] test fails in install - tries to install SLE12 packages -> update test for sle15

Resolved

JERiveraMoya

2017-12-05

2018-02-27

Actions

Related to openQA Project (public) - action #34783: Don't let jobs incomplete if mandatory resources are missing

Resolved

mkittler

2018-04-12

Actions

Related to openQA Project (public) - action #12180: [webui] Prevent tests to be triggered when required assets are not present (anymore)

New

2016-05-31

Actions

Blocks openQA Project (public) - action #44885: Cache service hiccups - Assets are deleted after they are downloaded

Resolved

okurz

2018-12-07

Actions

Copy link

Updated by coolo over 7 years ago

If it's used or not doesn't matter - if GRU deletes it, the job group was obviously not big enough to hold the working set. You have 100G for that group and the isos alone are around 70

But there is some subtle bug hidden, because SLES-15-x86_64-305.1-minimal_with_sdk305.1_installed.qcow2 is still present, but is 5 weeks old.

https://progress.opensuse.org/issues/19672#note-8 might be part of the puzzle
as is https://progress.opensuse.org/issues/16496 - it's just too hard atm to admin the job group sizes.

Actions

Copy link

Updated by coolo over 7 years ago

I increased the job group size of functional group to 500GB, 100GB was a bit lightweight

Actions

Copy link

Updated by okurz over 7 years ago

Related to action #19672: GRU may delete assets while jobs are registered added

Actions

Copy link

Updated by okurz over 7 years ago

Related to action #16496: [tools][sprint 201711.2] display current disk space consumption of job groups added

Actions

Copy link

Updated by okurz over 7 years ago

I guess I would be fine if the job would have been canceled with a message/state/comment instead of incomplete.

Actions

Copy link

Updated by SLindoMansilla over 7 years ago

Related to coordination #25380: [sle][functional][epic] test fails in install - tries to install SLE12 packages -> update test for sle15 added

Actions

Copy link

Updated by AdamWill over 7 years ago

I have thought for a while that the cleanup code should avoid deleting assets associated with pending/running jobs...

Actions

Copy link

Updated by okurz over 6 years ago

Related to action #34783: Don't let jobs incomplete if mandatory resources are missing added

Actions

Copy link

Updated by okurz over 5 years ago

Related to action #12180: [webui] Prevent tests to be triggered when required assets are not present (anymore) added

Actions

Copy link

#10

Updated by okurz over 5 years ago

Blocks action #44885: Cache service hiccups - Assets are deleted after they are downloaded added

Actions

Copy link

#11

Updated by okurz about 5 years ago

Status changed from New to Rejected
Assignee set to okurz

I guess by now we have changed the asset cleanup and quota management code enough again to call the behaviour currently described in this ticket as design. The alternative of locking assets for currently scheduled jobs even though job group quota is exceeded sounds dangerous as well. We could try to delete assets linked to not-unfinished jobs first but that there are also good arguments to prefer to keep assets of finished jobs so I don't think we should even make that call.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #28328

job was triggered trying to download HDD image but it's already gone

Observation¶

Problem¶

Updated by coolo over 7 years ago

Updated by coolo over 7 years ago

Updated by okurz over 7 years ago

Updated by okurz over 7 years ago

Updated by okurz over 7 years ago

Updated by SLindoMansilla over 7 years ago

Updated by AdamWill over 7 years ago

Updated by okurz over 6 years ago

Updated by okurz over 5 years ago

Updated by okurz over 5 years ago

Updated by okurz about 5 years ago