Project

General

Profile

Actions

action #97304

closed

Assets deleted even if there are still pending jobs size:M

Added by mkittler over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-08-20
Due date:
2021-09-07
% Done:

0%

Estimated time:

Description

observation

I've recently observed multiple occurrences where the parent job (e.g. https://openqa.suse.de/tests/6859366) successfully creates an asset (e.g. hdd/SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2) but the chained children incomplete (e.g. https://openqa.suse.de/tests/6859372) because they cannot download the asset anymore because it has already been cleaned up on the web UI host which can be seen in the logs:

[2021-08-19T18:10:34.0628 CEST] [debug] [pid:21356] Checking whether asset hdd/SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2 (2777677824) fits into group 306 (581430272)
[2021-08-19T18:15:59.0996 CEST] [debug] [pid:21356] {
  assets  => [
…
               {
                 fixed       => 0,
                 groups      => { 306 => 6859366 },
                 id          => 27413793,
                 max_job     => 6859366,
                 name        => "hdd/SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2",
                 parents     => { 8 => 1 },
                 pending     => 0,
                 picked_into => 0,
                 size        => 2777677824,
                 t_created   => "2021-08-19 15:35:50",
                 type        => "hdd",
               },
[2021-08-19T18:16:07.0773 CEST] [info] [pid:21356] Removing asset hdd/SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2 (belonging to job groups: 306 within parent job groups 8)
[2021-08-19T18:16:08.0067 CEST] [info] [pid:21356] GRU: removed /var/lib/openqa/share/factory/hdd/SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2

So the asset has been deleted 2021-08-19T18:16:08 CEST and the job using the asset has only been started on 2021-08-19 21:51:43 CEST.

All jobs have the asset correctly listed in the job settings (HDD_1=SLES-15-SP2-x86_64-mru-install-minimal-with-addons-Build:20740:libesmtp-Server-DVD-Incidents-64bit.qcow2 in the child and PUBLISH_HDD_1 in the parent).

expected behavior

"Pending" assets are preserved. So all assets which are associated with a job that is not done are cancelled are not subject to the assert cleanup.

further information

  1. We have already code which implements the expected behavior (in lib/OpenQA/Schema/ResultSet/Assets.pm) and there are also unit tests (in t/14-grutasks.t) to verify whether it works correctly. I've already extended those tests in the past (in https://github.com/os-autoinst/openQA/commit/22185d2d8f126990e8e1e4b6543d88f6bbc947ac) because we saw the same problem in the past (see #64544) but couldn't do more.
  2. It might be worth checking whether the implementation is correct but due to the previous point that's unlikely. Possibly the jobs were never correctly associated with the assets (despite the job settings being correct)?
  3. For later investigation I've been storing the database dump of OSD from that time on storage.qa.suse.de:/storage/osd-archive/osd-dump-for-poo-97304-2021-08-19.dump.

Related issues 1 (0 open1 closed)

Related to openQA Project (public) - action #98388: Non-existing asset "uefi-vars" is still shown up on #downloadsResolvedmkittler2021-09-09

Actions
Actions #1

Updated by mkittler over 3 years ago

  • Description updated (diff)
Actions #2

Updated by mkittler over 3 years ago

  • Assignee set to mkittler

It looks like we're only associating assets with a job when the asset is actually present:

sub register_assets_from_settings {
…
        my $f_asset = _asset_find($name, $type, $parent_job_ids);
        unless (defined $f_asset) {
            # don't register asset not yet available
            delete $assets{$k};
            next;
        }

That explains of course the behavior. I'm wondering why the code was written that way, e.g. what would break if we would just get rid of this check.

Actions #3

Updated by mkittler over 3 years ago

  • Status changed from New to In Progress
Actions #4

Updated by okurz over 3 years ago

  • Subject changed from Assets deleted even if there are still pending jobs to Assets deleted even if there are still pending jobs size:M
  • Target version set to Ready
Actions #5

Updated by openqa_review over 3 years ago

  • Due date set to 2021-09-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by mkittler over 3 years ago

I'm wondering why the code was written that way, e.g. what would break if we would just get rid of this check.

There were in fact two reasons why it was written that way:

  1. The detection for missing assets so far simply assumes any registered asset is also existing.
  2. The name of private assets is not known when registering assets for child jobs.

So https://github.com/os-autoinst/openQA/pull/4136 turned a little bit bigger after all but I hope in the end all problems should be dealt with (see commit messages).

Actions #7

Updated by mkittler over 3 years ago

  • Status changed from In Progress to Feedback

The PR has been merged, let's see whether it works in production.

The following query returns incompletes due to missing qcow2 downloads which have a parent job that would publish a qcow2 image:

select id, parent_job_id, t_finished, result, (select host from workers where id = assigned_worker_id) as worker from jobs right join job_dependencies on jobs.id = job_dependencies.child_job_id where (select count(id) from job_settings where job_id = parent_job_id and key like '%PUBLISH_HDD%' and value like '%.qcow2%' limit 1) >= 1 and reason like '%asset failure: Failed to download%.qcow2%' and t_finished >= '2021-08-18T00:00:00' order by t_finished;

It returns 175 jobs for the given time stamp (~ for last week) including the job from the ticket description. I'll execute the query after the PR has been deployed to see whether the situation improves.

Actions #8

Updated by livdywan over 3 years ago

Actions #9

Updated by mkittler over 3 years ago

Actions #10

Updated by mkittler over 3 years ago

Looks like qcow2 images are now correctly considered pending:

               {
                 fixed       => 0,
                 groups      => { 218 => 6974163, 354 => 6974164 },
                 id          => 27427318,
                 max_job     => 6974164,
                 name        => "hdd/SLES-12-SP4-s390x-mru-install-minimal-with-addons-Build20210831-1-Server-DVD-Updates-s390x-kvm-sle12.qcow2",
                 parents     => { 7 => 1, 35 => 1 },
                 pending     => 1,
                 picked_into => 218,
                 size        => undef,
                 t_created   => "2021-08-31 01:11:23",
                 type        => "hdd",
               },

I actually found this job via the query from my previous comment. There was really a job which failed to download this asset but only because https://openqa.suse.de/tests/6972839 really didn't create/upload that asset in the first place. The same counts for other jobs I found via that query, e.g. https://openqa.suse.de/tests/6972850. All of these jobs are PowerPC jobs; it happened because of bad needle svirt-asset-upload-hdd-image-uploaded-20210831, it's deleted now. However, I couldn't find any assets missing due to the cleanup. So as far as the ticket is concerned everything looks good at the moment.

Actions #11

Updated by mkittler over 3 years ago

  • Status changed from Feedback to Resolved

select id, parent_job_id, t_finished, result, (select host from workers where id = assigned_worker_id) as worker from jobs right join job_dependencies on jobs.id = job_dependencies.child_job_id where (select count(id) from job_settings where job_id = parent_job_id and key like '%PUBLISH_HDD%' and value like '%.qcow2%' limit 1) >= 1 and reason like '%asset failure: Failed to download%.qcow2%' and t_finished >= '2021-08-31T12:00:00' order by t_finished; shows only three jobs and none of the failing asset downloads is caused by the asset being removed by the cleanup too early. So I'm considering the issue resolved.

(In looks like the parent job didn't actually created the asset in the first two jobs. The third job has been restarted after the asset has already been removed without restarting the parent to re-create the asset.)

Actions #12

Updated by okurz about 3 years ago

  • Related to action #98388: Non-existing asset "uefi-vars" is still shown up on #downloads added
Actions

Also available in: Atom PDF