action #97304: Assets deleted even if there are still pending jobs size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #97304

closed

Assets deleted even if there are still pending jobs size:M

Added by mkittler over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2021-08-20

Due date:

2021-09-07

% Done:

Estimated time:

Description

observation¶

I've recently observed multiple occurrences where the parent job (e.g. https://openqa.suse.de/tests/6859366) successfully creates an asset (e.g. hdd/SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2) but the chained children incomplete (e.g. https://openqa.suse.de/tests/6859372) because they cannot download the asset anymore because it has already been cleaned up on the web UI host which can be seen in the logs:

[2021-08-19T18:10:34.0628 CEST] [debug] [pid:21356] Checking whether asset hdd/SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2 (2777677824) fits into group 306 (581430272)
[2021-08-19T18:15:59.0996 CEST] [debug] [pid:21356] {
  assets  => [
…
               {
                 fixed       => 0,
                 groups      => { 306 => 6859366 },
                 id          => 27413793,
                 max_job     => 6859366,
                 name        => "hdd/SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2",
                 parents     => { 8 => 1 },
                 pending     => 0,
                 picked_into => 0,
                 size        => 2777677824,
                 t_created   => "2021-08-19 15:35:50",
                 type        => "hdd",
               },
[2021-08-19T18:16:07.0773 CEST] [info] [pid:21356] Removing asset hdd/SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2 (belonging to job groups: 306 within parent job groups 8)
[2021-08-19T18:16:08.0067 CEST] [info] [pid:21356] GRU: removed /var/lib/openqa/share/factory/hdd/SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2

So the asset has been deleted 2021-08-19T18:16:08 CEST and the job using the asset has only been started on 2021-08-19 21:51:43 CEST.

All jobs have the asset correctly listed in the job settings (HDD_1=SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2 in the child and PUBLISH_HDD_1 in the parent).

expected behavior¶

"Pending" assets are preserved. So all assets which are associated with a job that is not done are cancelled are not subject to the assert cleanup.

further information¶

We have already code which implements the expected behavior (in lib/OpenQA/Schema/ResultSet/Assets.pm) and there are also unit tests (in t/14-grutasks.t) to verify whether it works correctly. I've already extended those tests in the past (in https://github.com/os-autoinst/openQA/commit/22185d2d8f126990e8e1e4b6543d88f6bbc947ac) because we saw the same problem in the past (see #64544) but couldn't do more.
It might be worth checking whether the implementation is correct but due to the previous point that's unlikely. Possibly the jobs were never correctly associated with the assets (despite the job settings being correct)?
For later investigation I've been storing the database dump of OSD from that time on storage.qa.suse.de:/storage/osd-archive/osd-dump-for-poo-97304-2021-08-19.dump.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by mkittler over 3 years ago

Description updated (diff)

Actions

Copy link

Updated by mkittler over 3 years ago

Assignee set to mkittler

It looks like we're only associating assets with a job when the asset is actually present:

sub register_assets_from_settings {
…
        my $f_asset = _asset_find($name, $type, $parent_job_ids);
        unless (defined $f_asset) {
            # don't register asset not yet available
            delete $assets{$k};
            next;
        }

That explains of course the behavior. I'm wondering why the code was written that way, e.g. what would break if we would just get rid of this check.

Actions

Copy link

Updated by mkittler over 3 years ago

Status changed from New to In Progress

Draft for testing it out: https://github.com/os-autoinst/openQA/pull/4136

Actions

Copy link

Updated by okurz over 3 years ago

Subject changed from Assets deleted even if there are still pending jobs to Assets deleted even if there are still pending jobs size:M
Target version set to Ready

Actions

Copy link

Updated by openqa_review over 3 years ago

Due date set to 2021-09-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by mkittler over 3 years ago

I'm wondering why the code was written that way, e.g. what would break if we would just get rid of this check.

There were in fact two reasons why it was written that way:

The detection for missing assets so far simply assumes any registered asset is also existing.
The name of private assets is not known when registering assets for child jobs.

So https://github.com/os-autoinst/openQA/pull/4136 turned a little bit bigger after all but I hope in the end all problems should be dealt with (see commit messages).

Actions

Copy link

Updated by mkittler over 3 years ago

Status changed from In Progress to Feedback

The PR has been merged, let's see whether it works in production.

The following query returns incompletes due to missing qcow2 downloads which have a parent job that would publish a qcow2 image:

select id, parent_job_id, t_finished, result, (select host from workers where id = assigned_worker_id) as worker from jobs right join job_dependencies on jobs.id = job_dependencies.child_job_id where (select count(id) from job_settings where job_id = parent_job_id and key like '%PUBLISH_HDD%' and value like '%.qcow2%' limit 1) >= 1 and reason like '%asset failure: Failed to download%.qcow2%' and t_finished >= '2021-08-18T00:00:00' order by t_finished;

It returns 175 jobs for the given time stamp (~ for last week) including the job from the ticket description. I'll execute the query after the PR has been deployed to see whether the situation improves.

Actions

Copy link

Updated by livdywan over 3 years ago

Related to coordination #64881: [epic] Reconsider triggering cleanup jobs added

Actions

Copy link

Updated by mkittler over 3 years ago

Related to deleted (coordination #64881: [epic] Reconsider triggering cleanup jobs)

Actions

Copy link

#10

Updated by mkittler over 3 years ago

Looks like qcow2 images are now correctly considered pending:

               {
                 fixed       => 0,
                 groups      => { 218 => 6974163, 354 => 6974164 },
                 id          => 27427318,
                 max_job     => 6974164,
                 name        => "hdd/SLES-12-SP4-s390x-mru-install-minimal-with-addons-Build20210831-1-Server-DVD-Updates-s390x-kvm-sle12.qcow2",
                 parents     => { 7 => 1, 35 => 1 },
                 pending     => 1,
                 picked_into => 218,
                 size        => undef,
                 t_created   => "2021-08-31 01:11:23",
                 type        => "hdd",
               },

I actually found this job via the query from my previous comment. There was really a job which failed to download this asset but only because https://openqa.suse.de/tests/6972839 really didn't create/upload that asset in the first place. The same counts for other jobs I found via that query, e.g. https://openqa.suse.de/tests/6972850. All of these jobs are PowerPC jobs; it happened because of bad needle svirt-asset-upload-hdd-image-uploaded-20210831, it's deleted now. However, I couldn't find any assets missing due to the cleanup. So as far as the ticket is concerned everything looks good at the moment.

Actions

Copy link

#11

Updated by mkittler over 3 years ago

Status changed from Feedback to Resolved

select id, parent_job_id, t_finished, result, (select host from workers where id = assigned_worker_id) as worker from jobs right join job_dependencies on jobs.id = job_dependencies.child_job_id where (select count(id) from job_settings where job_id = parent_job_id and key like '%PUBLISH_HDD%' and value like '%.qcow2%' limit 1) >= 1 and reason like '%asset failure: Failed to download%.qcow2%' and t_finished >= '2021-08-31T12:00:00' order by t_finished; shows only three jobs and none of the failing asset downloads is caused by the asset being removed by the cleanup too early. So I'm considering the issue resolved.

(In looks like the parent job didn't actually created the asset in the first two jobs. The third job has been restarted after the asset has already been removed without restarting the parent to re-create the asset.)

Actions

Copy link

#12

Updated by okurz over 3 years ago

Related to action #98388: Non-existing asset "uefi-vars" is still shown up on #downloads added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #97304

Assets deleted even if there are still pending jobs size:M

observation¶

expected behavior¶

further information¶

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by okurz over 3 years ago

Updated by openqa_review over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by livdywan over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by okurz over 3 years ago