action #97304
closedAssets deleted even if there are still pending jobs size:M
Description
observation¶
I've recently observed multiple occurrences where the parent job (e.g. https://openqa.suse.de/tests/6859366) successfully creates an asset (e.g. hdd/SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2
) but the chained children incomplete (e.g. https://openqa.suse.de/tests/6859372) because they cannot download the asset anymore because it has already been cleaned up on the web UI host which can be seen in the logs:
[2021-08-19T18:10:34.0628 CEST] [debug] [pid:21356] Checking whether asset hdd/SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2 (2777677824) fits into group 306 (581430272)
[2021-08-19T18:15:59.0996 CEST] [debug] [pid:21356] {
assets => [
…
{
fixed => 0,
groups => { 306 => 6859366 },
id => 27413793,
max_job => 6859366,
name => "hdd/SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2",
parents => { 8 => 1 },
pending => 0,
picked_into => 0,
size => 2777677824,
t_created => "2021-08-19 15:35:50",
type => "hdd",
},
[2021-08-19T18:16:07.0773 CEST] [info] [pid:21356] Removing asset hdd/SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2 (belonging to job groups: 306 within parent job groups 8)
[2021-08-19T18:16:08.0067 CEST] [info] [pid:21356] GRU: removed /var/lib/openqa/share/factory/hdd/SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2
So the asset has been deleted 2021-08-19T18:16:08 CEST and the job using the asset has only been started on 2021-08-19 21:51:43 CEST.
All jobs have the asset correctly listed in the job settings (HDD_1=SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2
in the child and PUBLISH_HDD_1
in the parent).
expected behavior¶
"Pending" assets are preserved. So all assets which are associated with a job that is not done
are cancelled
are not subject to the assert cleanup.
further information¶
- We have already code which implements the expected behavior (in
lib/OpenQA/Schema/ResultSet/Assets.pm
) and there are also unit tests (int/14-grutasks.t
) to verify whether it works correctly. I've already extended those tests in the past (in https://github.com/os-autoinst/openQA/commit/22185d2d8f126990e8e1e4b6543d88f6bbc947ac) because we saw the same problem in the past (see #64544) but couldn't do more. - It might be worth checking whether the implementation is correct but due to the previous point that's unlikely. Possibly the jobs were never correctly associated with the assets (despite the job settings being correct)?
- For later investigation I've been storing the database dump of OSD from that time on
storage.qa.suse.de:/storage/osd-archive/osd-dump-for-poo-97304-2021-08-19.dump
.
Updated by mkittler about 3 years ago
- Assignee set to mkittler
It looks like we're only associating assets with a job when the asset is actually present:
sub register_assets_from_settings {
…
my $f_asset = _asset_find($name, $type, $parent_job_ids);
unless (defined $f_asset) {
# don't register asset not yet available
delete $assets{$k};
next;
}
That explains of course the behavior. I'm wondering why the code was written that way, e.g. what would break if we would just get rid of this check.
Updated by mkittler about 3 years ago
- Status changed from New to In Progress
Draft for testing it out: https://github.com/os-autoinst/openQA/pull/4136
Updated by okurz about 3 years ago
- Subject changed from Assets deleted even if there are still pending jobs to Assets deleted even if there are still pending jobs size:M
- Target version set to Ready
Updated by openqa_review about 3 years ago
- Due date set to 2021-09-07
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler about 3 years ago
I'm wondering why the code was written that way, e.g. what would break if we would just get rid of this check.
There were in fact two reasons why it was written that way:
- The detection for missing assets so far simply assumes any registered asset is also existing.
- The name of private assets is not known when registering assets for child jobs.
So https://github.com/os-autoinst/openQA/pull/4136 turned a little bit bigger after all but I hope in the end all problems should be dealt with (see commit messages).
Updated by mkittler about 3 years ago
- Status changed from In Progress to Feedback
The PR has been merged, let's see whether it works in production.
The following query returns incompletes due to missing qcow2 downloads which have a parent job that would publish a qcow2 image:
select id, parent_job_id, t_finished, result, (select host from workers where id = assigned_worker_id) as worker from jobs right join job_dependencies on jobs.id = job_dependencies.child_job_id where (select count(id) from job_settings where job_id = parent_job_id and key like '%PUBLISH_HDD%' and value like '%.qcow2%' limit 1) >= 1 and reason like '%asset failure: Failed to download%.qcow2%' and t_finished >= '2021-08-18T00:00:00' order by t_finished;
It returns 175 jobs for the given time stamp (~ for last week) including the job from the ticket description. I'll execute the query after the PR has been deployed to see whether the situation improves.
Updated by livdywan about 3 years ago
- Related to coordination #64881: [epic] Reconsider triggering cleanup jobs added
Updated by mkittler about 3 years ago
- Related to deleted (coordination #64881: [epic] Reconsider triggering cleanup jobs)
Updated by mkittler about 3 years ago
Looks like qcow2 images are now correctly considered pending:
{
fixed => 0,
groups => { 218 => 6974163, 354 => 6974164 },
id => 27427318,
max_job => 6974164,
name => "hdd/SLES-12-SP4-s390x-mru-install-minimal-with-addons-Build20210831-1-Server-DVD-Updates-s390x-kvm-sle12.qcow2",
parents => { 7 => 1, 35 => 1 },
pending => 1,
picked_into => 218,
size => undef,
t_created => "2021-08-31 01:11:23",
type => "hdd",
},
I actually found this job via the query from my previous comment. There was really a job which failed to download this asset but only because https://openqa.suse.de/tests/6972839 really didn't create/upload that asset in the first place. The same counts for other jobs I found via that query, e.g. https://openqa.suse.de/tests/6972850. All of these jobs are PowerPC jobs; it happened because of bad needle svirt-asset-upload-hdd-image-uploaded-20210831, it's deleted now. However, I couldn't find any assets missing due to the cleanup. So as far as the ticket is concerned everything looks good at the moment.
Updated by mkittler about 3 years ago
- Status changed from Feedback to Resolved
select id, parent_job_id, t_finished, result, (select host from workers where id = assigned_worker_id) as worker from jobs right join job_dependencies on jobs.id = job_dependencies.child_job_id where (select count(id) from job_settings where job_id = parent_job_id and key like '%PUBLISH_HDD%' and value like '%.qcow2%' limit 1) >= 1 and reason like '%asset failure: Failed to download%.qcow2%' and t_finished >= '2021-08-31T12:00:00' order by t_finished;
shows only three jobs and none of the failing asset downloads is caused by the asset being removed by the cleanup too early. So I'm considering the issue resolved.
(In looks like the parent job didn't actually created the asset in the first two jobs. The third job has been restarted after the asset has already been removed without restarting the parent to re-create the asset.)
Updated by okurz about 3 years ago
- Related to action #98388: Non-existing asset "uefi-vars" is still shown up on #downloads added