action #41483

[tools] medium that should belong to job group was deleted after just 2 minutes making our SLE15 tests in build 47.1 useless ... and more

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:ResolvedStart date:24/09/2018
Priority:UrgentDue date:
Assignee:coolo% Done:

0%

Category:Concrete Bugs
Target version:Done
Difficulty:hard
Duration:

Description

Observation

/var/log/openqa.1.xz:[2018-09-23T16:49:13.0743 CEST] [info] Registering asset iso/SLE-15-SP1-Installer-DVD-x86_64-Build47.1-Media1.iso
/var/log/openqa.1.xz:[2018-09-23T16:51:50.0297 CEST] [warn] Asset iso/SLE-15-SP1-Installer-DVD-x86_64-Build47.1-Media1.iso is not in any job group, will delete in 2 days
/var/log/openqa.1.xz:[2018-09-23T16:52:26.0339 CEST] [info] GRU: removing /var/lib/openqa/share/factory/iso/SLE-15-SP1-Installer-DVD-x86_64-Build47.1-Media1.iso

-> [24 Sep 2018 09:07:38] <coolo> 2 days became minutes quickly :)

Problem

This looks like introduced by recent changes to track non-job-group assets better.


Related issues

Related to openQA Project - action #41543: parallel_failed jobs miss t_finished Resolved 25/09/2018
Related to openQA Project - action #45041: Asset cache cleanup still fails if new jobs are created a... Resolved 12/12/2018

History

#1 Updated by okurz over 1 year ago

  • Description updated (diff)

#2 Updated by okurz over 1 year ago

Seems there had been a deployment on 2018-09-23 bringing in recent changes to handle assets which are not in a group better?

#3 Updated by coolo over 1 year ago

  • Assignee set to coolo

I don't think, that the 'groupless' tracking is faulty here - that part said 'in 2 days'. It rather looks to me like the ISO was accounted to a very tiny group in the 2nd round - the ISO wasn't really done scheduling :(

But we lack that infos.

#4 Updated by mkittler over 1 year ago

  • Assignee deleted (coolo)

Asset iso/SLE-15-SP1-Installer-DVD-x86_64-Build47.1-Media1.iso is not in any job group, will delete in 2 days

Apparently there was really no group assigned at this point or the code for untracked assets is faulty, indeed. I've just looked at the code again but can't see any obvious mistake.

GRU: removing /var/lib/openqa/share/factory/iso/SLE-15-SP1-Installer-DVD-x86_64-Build47.1-Media1.iso

We don't know why the asset has been removed. I suspect it wasn't group-less anymore at this point so the 2 days didn't apply anymore. Like coolo said it could have been accounted to a very tiny group.

So we should likely log why an asset has been deleted and maybe it makes sense not to delete anything while scheduling an ISO.

#5 Updated by mkittler over 1 year ago

  • Assignee set to coolo

#6 Updated by coolo over 1 year ago

We now have more debug - and as expeccted: the 'no group' handling is not the problem:

[2018-09-24T16:25:51.0466 CEST] [info] Registering asset iso/SLE-12-SP4-Server-DVD-aarch64-Build0395-Media1.iso
[2018-09-24T16:28:56.0168 CEST] [warn] Asset iso/SLE-12-SP4-Server-DVD-aarch64-Build0395-Media1.iso is not in any job group, will delete in 1.99927083333333 days
[2018-09-24T16:36:00.0991 CEST] [warn] Asset iso/SLE-12-SP4-Server-DVD-aarch64-Build0395-Media1.iso is not in any job group, will delete in 1.99423611111111 days
[2018-09-24T16:37:20.0982 CEST] [info] Removing asset iso/SLE-12-SP4-Server-DVD-aarch64-Build0395-Media1.iso (assigned to groups: 151, 155, 158, 169, 132)
[2018-09-24T16:37:21.0008 CEST] [info] GRU: removing /var/lib/openqa/share/factory/iso/SLE-12-SP4-Server-DVD-aarch64-Build0395-Media1.iso

#7 Updated by coolo over 1 year ago

openqa=> select size_limit_gb,exclusively_kept_asset_size/1024/1024/1024 from job_groups where id in (151, 155, 158, 169, 132);
 size_limit_gb | ?column? 
---------------+----------
            30 |       29
            30 |       29
            30 |       29
            30 |       29
            40 |       39

#8 Updated by coolo over 1 year ago

So the size limit of the groups isn't the problem either. The picked_into algorithm regressed

#9 Updated by coolo over 1 year ago

I added some more debug to production - and while build395 is gone, this is how it looks like currently:

                groups      => {
                                 132 => 2084286,
                                 139 => 2068130,
                                 142 => 2068075,
                                 151 => 2075007,
                                 155 => 2073132,
                                 158 => 2073533,
                                 169 => 2073683,
                                 189 => 2089744,
                               },
                id          => 299793,
                max_job     => 2089744,
                name        => "iso/SLE-12-SP4-Server-DVD-aarch64-Build0393-Media1.iso",
                pending     => 0,
                picked_into => 139,


                groups      => {
                                 132 => 2054116,
                                 139 => 2054738,
                                 142 => 2054121,
                                 151 => 2061413,
                                 155 => 2054111,
                                 158 => 2054087,
                                 169 => 2052488,
                                 189 => 2074925,
                               },
                id          => 293458,
                max_job     => 2074925,
                name        => "iso/SLE-12-SP4-Server-DVD-aarch64-Build0387-Media1.iso",
                pending     => 0,
                picked_into => 189,


                groups      => {
                                 132 => 2060328,
                                 139 => 2062023,
                                 142 => 2060371,
                                 151 => 2061938,
                                 155 => 2061941,
                                 158 => 2062031,
                                 169 => 2061918,
                               },
                id          => 297063,
                max_job     => 2062031,
                name        => "iso/SLE-12-SP4-Server-DVD-aarch64-Build0391-Media1.iso",
                pending     => 1,
                picked_into => 169,

                groups      => {
                                 132 => 2031928,
                                 139 => 2033935,
                                 142 => 2031972,
                                 151 => 2036324,
                                 155 => 2032012,
                                 158 => 2039164,
                                 169 => 2045759,
                                 188 => 2042337,
                                 189 => 2059680,
                               },
                id          => 289361,
                max_job     => 2059680,
                name        => "iso/SLE-12-SP4-Server-DVD-aarch64-Build0378-Media1.iso",
                pending     => 0,
                picked_into => 188,


                groups      => {
                                 132 => 2031747,
                                 139 => 2031748,
                                 142 => 2029312,
                                 151 => 2030963,
                                 155 => 2031788,
                                 158 => 2034271,
                                 169 => 2030630,
                                 188 => 2031557,
                                 189 => 2036358,
                               },
                id          => 287540,
                max_job     => 2036358,
                name        => "iso/SLE-12-SP4-Server-DVD-aarch64-Build0377-Media1.iso",
                pending     => 0,
                picked_into => 188,


                groups      => {
                                 139 => 2023596,
                                 142 => 2023598,
                                 151 => 2025813,
                                 155 => 2027085,
                                 158 => 2027800,
                                 169 => 2029112,
                                 189 => 2028517,
                               },
                id          => 285693,
                max_job     => 2029112,
                name        => "iso/SLE-12-SP4-Server-DVD-aarch64-Build0374-Media1.iso",
                pending     => 0,
                picked_into => 189,

                groups      => {
                                 139 => 2001197,
                                 142 => 2001198,
                                 151 => 2005127,
                                 155 => 2027247,
                                 158 => 2027799,
                                 169 => 2001253,
                                 189 => 2019737,
                               },
                id          => 275764,
                max_job     => 2027799,
                name        => "iso/SLE-12-SP4-Server-DVD-aarch64-Build0365-Media1.iso",
                pending     => 0,
                picked_into => 189,

This looks all insane (keeping 7 builds while removing the latest), but the only suprising part is the one pending=1

#10 Updated by coolo over 1 year ago

The strange priority on picking what to delete is obvious on resync:

openqa:/home/coolo # grep -C1 SLES-12-SP4-x86_64-03 /srv/log/openqa_gru     
                max_job     => 2089602,
                name        => "hdd/SLES-12-SP4-x86_64-0395\@64bit-minimal_with_sdk0311_installed_withhome.qcow2",
                pending     => 1,
--
                max_job     => 2087256,
                name        => "hdd/SLES-12-SP4-x86_64-0395\@64bit-pcm_googlecloud.qcow2",
                pending     => 0,
--
                max_job     => 2087245,
                name        => "hdd/SLES-12-SP4-x86_64-0395\@64bit-pcm_aws.qcow2",
                pending     => 0,
--
                max_job     => 2087188,
                name        => "hdd/SLES-12-SP4-x86_64-0395\@64bit-pcm_azure.qcow2",
                pending     => 0,
--
                max_job     => 2087185,
                name        => "hdd/SLES-12-SP4-x86_64-0395\@64bit-minimal_with_sdk0311_installed.qcow2",
                pending     => 0,
--
                max_job     => 2087129,
                name        => "hdd/SLES-12-SP4-x86_64-0395\@64bit-minimal_with_sdk0311_pcm_installed.qcow2",
                pending     => 0,
--
                max_job     => 2073650,
                name        => "hdd/SLES-12-SP4-x86_64-0393\@64bit-pcm_aws.qcow2",
                pending     => 0,
--
                max_job     => 2073649,
                name        => "hdd/SLES-12-SP4-x86_64-0393\@64bit-pcm_googlecloud.qcow2",
                pending     => 0,
--
                max_job     => 2073020,
                name        => "hdd/SLES-12-SP4-x86_64-0393\@64bit-minimal_with_sdk0309_installed.qcow2",
                pending     => 0,
--
                max_job     => 2068838,
                name        => "hdd/SLES-12-SP4-x86_64-0393\@64bit-minimal_with_sdk0309_installed_withhome.qcow2",
                pending     => 0,
--
                max_job     => 2061191,
                name        => "hdd/SLES-12-SP4-x86_64-0391\@64bit-minimal_with_sdk0307_installed.qcow2",
                pending     => 0,
--
                max_job     => 2053519,
                name        => "hdd/SLES-12-SP4-x86_64-0387\@64bit-minimal_with_sdk0304_installed.qcow2",
                pending     => 0,
--
                max_job     => 2037870,
                name        => "hdd/SLES-12-SP4-x86_64-0378\@64bit-minimal_with_sdk0298_installed.qcow2",
                pending     => 0,
--
                max_job     => 2031324,
                name        => "hdd/SLES-12-SP4-x86_64-0377\@64bit-minimal_with_sdk0297_installed.qcow2",
                pending     => 0,
--
                max_job     => 2000997,
                name        => "hdd/SLES-12-SP4-x86_64-0352\@64bit-minimal_with_sdk0280_installed.qcow2",
                pending     => 1,
--
                max_job     => 2089602,
                name        => "hdd/SLES-12-SP4-x86_64-0395\@64bit-minimal_with_sdk0311_installed_withhome.qcow2",
                pending     => 1,
--
                max_job     => 2087256,
                name        => "hdd/SLES-12-SP4-x86_64-0395\@64bit-pcm_googlecloud.qcow2",
                pending     => 0,
--
                max_job     => 2087245,
                name        => "hdd/SLES-12-SP4-x86_64-0395\@64bit-pcm_aws.qcow2",
                pending     => 0,
--
                max_job     => 2087188,
                name        => "hdd/SLES-12-SP4-x86_64-0395\@64bit-pcm_azure.qcow2",
                pending     => 0,
--
                max_job     => 2087185,
                name        => "hdd/SLES-12-SP4-x86_64-0395\@64bit-minimal_with_sdk0311_installed.qcow2",
                pending     => 0,
--
                max_job     => 2087129,
                name        => "hdd/SLES-12-SP4-x86_64-0395\@64bit-minimal_with_sdk0311_pcm_installed.qcow2",
                pending     => 0,
--
                max_job     => 2073650,
                name        => "hdd/SLES-12-SP4-x86_64-0393\@64bit-pcm_aws.qcow2",
                pending     => 0,
--
                max_job     => 2073649,
                name        => "hdd/SLES-12-SP4-x86_64-0393\@64bit-pcm_googlecloud.qcow2",
                pending     => 0,
--
                max_job     => 2073020,
                name        => "hdd/SLES-12-SP4-x86_64-0393\@64bit-minimal_with_sdk0309_installed.qcow2",
                pending     => 0,
--
                max_job     => 2068838,
                name        => "hdd/SLES-12-SP4-x86_64-0393\@64bit-minimal_with_sdk0309_installed_withhome.qcow2",
                pending     => 0,
--
                max_job     => 2061191,
                name        => "hdd/SLES-12-SP4-x86_64-0391\@64bit-minimal_with_sdk0307_installed.qcow2",
                pending     => 0,
--
                max_job     => 2053519,
                name        => "hdd/SLES-12-SP4-x86_64-0387\@64bit-minimal_with_sdk0304_installed.qcow2",
                pending     => 0,
--
                max_job     => 2037870,
                name        => "hdd/SLES-12-SP4-x86_64-0378\@64bit-minimal_with_sdk0298_installed.qcow2",
                pending     => 0,
--
                max_job     => 2031324,
                name        => "hdd/SLES-12-SP4-x86_64-0377\@64bit-minimal_with_sdk0297_installed.qcow2",
                pending     => 0,
--
                max_job     => 2000997,
                name        => "hdd/SLES-12-SP4-x86_64-0352\@64bit-minimal_with_sdk0280_installed.qcow2",
                pending     => 1,
--
[2018-09-25T09:09:40.0471 CEST] [info] GRU: removing /var/lib/openqa/share/factory/hdd/SLES-12-SP3-x86_64-mru-install-minimal-with-addons-Build:8837:ghostscript.1537779325.qcow2
[2018-09-25T09:09:40.0509 CEST] [info] Removing asset hdd/SLES-12-SP4-x86_64-0393@64bit-pcm_aws.qcow2 (assigned to groups: 139)
[2018-09-25T09:09:40.0517 CEST] [info] GRU: removing /var/lib/openqa/share/factory/hdd/SLES-12-SP4-x86_64-0393@64bit-pcm_aws.qcow2
[2018-09-25T09:09:40.0560 CEST] [info] Removing asset hdd/SLES-12-SP4-x86_64-0393@64bit-pcm_googlecloud.qcow2 (assigned to groups: 139)
[2018-09-25T09:09:40.0573 CEST] [info] GRU: removing /var/lib/openqa/share/factory/hdd/SLES-12-SP4-x86_64-0393@64bit-pcm_googlecloud.qcow2
[2018-09-25T09:09:40.0596 CEST] [info] Removing asset repo/SLE-12-SP4-SERVER-POOL-ppc64le-Build0391-Media3 (assigned to groups: 155, 158, 151, 139, 157, 184, 132, 142)

#11 Updated by coolo over 1 year ago

So one bug: 'pending' is calculated depending on 't_finished is null' - but that's true for skipped jobs too.

select distinct state,result from jobs where t_finished is null;

     state      |       result       
----------------+--------------------
 running        | none
 user_cancelled | parallel_restarted
 scheduled      | none
 skipped        | parallel_restarted
 user_cancelled | parallel_failed
 uploading      | none

#12 Updated by coolo over 1 year ago

  • Related to action #41543: parallel_failed jobs miss t_finished added

#13 Updated by coolo over 1 year ago

  • Target version set to Current Sprint
  • Difficulty set to hard

Now that the pending calculcation is fixed, some old assets are removed and freeing for more important builds. Now awaiting the next full build to see if it's handled correctly

#14 Updated by coolo over 1 year ago

  • Priority changed from Urgent to High

There was one bug in the calculation of the pending state - but it doesn't explain the effect we see. The added debug output hopefully helps understanding the problem if it hits again

#15 Updated by coolo over 1 year ago

So what's noteworthy - when the asset was removed, it only had "151, 155, 158, 169, 132" assigned

Now it has additionally 139 and 142 - which are also the much larger groups.

#16 Updated by coolo over 1 year ago

We have no useful data on what could happen. This is just how it's supposed to be.

After sync, we see it as this:
[warn] Asset iso/SLE-12-SP4-Server-DVD-x86_64-Build0396-Media1.iso is not in any job group, will delete in 1.99846064814815 days

Then we schedule it and see it as:
[debug] Asset iso/SLE-12-SP4-Server-DVD-x86_64-Build0396-Media1.iso (4011851776) picked into 139

                groups      => {},
                id          => 319138,
                max_job     => undef,
                name        => "iso/SLE-12-SP4-Server-DVD-x86_64-Build0396-Media1.iso",
                pending     => 0,
                picked_into => 0,
                size        => 4011851776,
--
                               },
                id          => 319138,
                max_job     => 2095195,
                name        => "iso/SLE-12-SP4-Server-DVD-x86_64-Build0396-Media1.iso",
                pending     => 1,
                picked_into => 139,
                size        => 4011851776,

#17 Updated by coolo over 1 year ago

  • Priority changed from High to Urgent

Now we just had another case and it's quite obvious.

[2018-09-28T15:59:22.0841 CEST] [info] Registering asset iso/SLE-15-SP1-Installer-DVD-ppc64le-Build53.7-Media1.iso

And then it's looked at as very last in the join. But the infos about it are pretty obviously
showing the opposite:

          {
                fixed       => 0,
                groups      => {
                                 96  => 2104857,
                                 110 => 2104851,
                                 111 => 2104787,
                                 112 => 2104811,
                                 116 => 2104859,
                                 129 => 2104853,
                                 143 => 2104746,
                                 146 => 2104824,
                               },
                id          => 326573,
                max_job     => undef,
                name        => "iso/SLE-15-SP1-Installer-DVD-ppc64le-Build53.7-Media1.iso",
                pending     => 0,
                picked_into => 0,
                size        => 624281600,
                t_created   => "2018-09-28 13:59:22",
                type        => "iso",
              },

So we have no max_job at all, but when we look for groups, we see a lot of them. Which means when the initial
join through the assets was done, the asset wasn't yet scheduled - but when the query per group was done,
we got some.

But as the assets are looked in the initial order, it no longer fits - but as it has groups, it's deleted.

#19 Updated by nicksinger over 1 year ago

  • Subject changed from [tools] medium that should belong to job group was deleted after just 2 minutes making our SLE15 tests in build 47.1 useless to [tools] medium that should belong to job group was deleted after just 2 minutes making our SLE15 tests in build 47.1 useless ... and more

#21 Updated by coolo over 1 year ago

  • Status changed from New to Resolved

We're confident that this will fix the issue - and so far we didn't see the problem since hotdeploying the first fix

#22 Updated by coolo over 1 year ago

  • Target version changed from Current Sprint to Done

#23 Updated by mkittler about 1 year ago

  • Related to action #45041: Asset cache cleanup still fails if new jobs are created at the same time added

Also available in: Atom PDF