action #41483
closed[tools] medium that should belong to job group was deleted after just 2 minutes making our SLE15 tests in build 47.1 useless ... and more
Added by okurz about 6 years ago. Updated about 6 years ago.
Description
Observation¶
/var/log/openqa.1.xz:[2018-09-23T16:49:13.0743 CEST] [info] Registering asset iso/SLE-15-SP1-Installer-DVD-x86_64-Build47.1-Media1.iso
/var/log/openqa.1.xz:[2018-09-23T16:51:50.0297 CEST] [warn] Asset iso/SLE-15-SP1-Installer-DVD-x86_64-Build47.1-Media1.iso is not in any job group, will delete in 2 days
/var/log/openqa.1.xz:[2018-09-23T16:52:26.0339 CEST] [info] GRU: removing /var/lib/openqa/share/factory/iso/SLE-15-SP1-Installer-DVD-x86_64-Build47.1-Media1.iso
-> [24 Sep 2018 09:07:38] <coolo> 2 days became minutes quickly :)
Problem¶
This looks like introduced by recent changes to track non-job-group assets better.
Updated by okurz about 6 years ago
Seems there had been a deployment on 2018-09-23 bringing in recent changes to handle assets which are not in a group better?
Updated by coolo about 6 years ago
- Assignee set to coolo
I don't think, that the 'groupless' tracking is faulty here - that part said 'in 2 days'. It rather looks to me like the ISO was accounted to a very tiny group in the 2nd round - the ISO wasn't really done scheduling :(
But we lack that infos.
Updated by mkittler about 6 years ago
- Assignee deleted (
coolo)
Asset iso/SLE-15-SP1-Installer-DVD-x86_64-Build47.1-Media1.iso is not in any job group, will delete in 2 days
Apparently there was really no group assigned at this point or the code for untracked assets is faulty, indeed. I've just looked at the code again but can't see any obvious mistake.
GRU: removing /var/lib/openqa/share/factory/iso/SLE-15-SP1-Installer-DVD-x86_64-Build47.1-Media1.iso
We don't know why the asset has been removed. I suspect it wasn't group-less anymore at this point so the 2 days didn't apply anymore. Like coolo said it could have been accounted to a very tiny group.
So we should likely log why an asset has been deleted and maybe it makes sense not to delete anything while scheduling an ISO.
Updated by coolo about 6 years ago
We now have more debug - and as expeccted: the 'no group' handling is not the problem:
[2018-09-24T16:25:51.0466 CEST] [info] Registering asset iso/SLE-12-SP4-Server-DVD-aarch64-Build0395-Media1.iso
[2018-09-24T16:28:56.0168 CEST] [warn] Asset iso/SLE-12-SP4-Server-DVD-aarch64-Build0395-Media1.iso is not in any job group, will delete in 1.99927083333333 days
[2018-09-24T16:36:00.0991 CEST] [warn] Asset iso/SLE-12-SP4-Server-DVD-aarch64-Build0395-Media1.iso is not in any job group, will delete in 1.99423611111111 days
[2018-09-24T16:37:20.0982 CEST] [info] Removing asset iso/SLE-12-SP4-Server-DVD-aarch64-Build0395-Media1.iso (assigned to groups: 151, 155, 158, 169, 132)
[2018-09-24T16:37:21.0008 CEST] [info] GRU: removing /var/lib/openqa/share/factory/iso/SLE-12-SP4-Server-DVD-aarch64-Build0395-Media1.iso
Updated by coolo about 6 years ago
openqa=> select size_limit_gb,exclusively_kept_asset_size/1024/1024/1024 from job_groups where id in (151, 155, 158, 169, 132);
size_limit_gb | ?column?
---------------+----------
30 | 29
30 | 29
30 | 29
30 | 29
40 | 39
Updated by coolo about 6 years ago
So the size limit of the groups isn't the problem either. The picked_into algorithm regressed
Updated by coolo about 6 years ago
I added some more debug to production - and while build395 is gone, this is how it looks like currently:
groups => {
132 => 2084286,
139 => 2068130,
142 => 2068075,
151 => 2075007,
155 => 2073132,
158 => 2073533,
169 => 2073683,
189 => 2089744,
},
id => 299793,
max_job => 2089744,
name => "iso/SLE-12-SP4-Server-DVD-aarch64-Build0393-Media1.iso",
pending => 0,
picked_into => 139,
groups => {
132 => 2054116,
139 => 2054738,
142 => 2054121,
151 => 2061413,
155 => 2054111,
158 => 2054087,
169 => 2052488,
189 => 2074925,
},
id => 293458,
max_job => 2074925,
name => "iso/SLE-12-SP4-Server-DVD-aarch64-Build0387-Media1.iso",
pending => 0,
picked_into => 189,
groups => {
132 => 2060328,
139 => 2062023,
142 => 2060371,
151 => 2061938,
155 => 2061941,
158 => 2062031,
169 => 2061918,
},
id => 297063,
max_job => 2062031,
name => "iso/SLE-12-SP4-Server-DVD-aarch64-Build0391-Media1.iso",
pending => 1,
picked_into => 169,
groups => {
132 => 2031928,
139 => 2033935,
142 => 2031972,
151 => 2036324,
155 => 2032012,
158 => 2039164,
169 => 2045759,
188 => 2042337,
189 => 2059680,
},
id => 289361,
max_job => 2059680,
name => "iso/SLE-12-SP4-Server-DVD-aarch64-Build0378-Media1.iso",
pending => 0,
picked_into => 188,
groups => {
132 => 2031747,
139 => 2031748,
142 => 2029312,
151 => 2030963,
155 => 2031788,
158 => 2034271,
169 => 2030630,
188 => 2031557,
189 => 2036358,
},
id => 287540,
max_job => 2036358,
name => "iso/SLE-12-SP4-Server-DVD-aarch64-Build0377-Media1.iso",
pending => 0,
picked_into => 188,
groups => {
139 => 2023596,
142 => 2023598,
151 => 2025813,
155 => 2027085,
158 => 2027800,
169 => 2029112,
189 => 2028517,
},
id => 285693,
max_job => 2029112,
name => "iso/SLE-12-SP4-Server-DVD-aarch64-Build0374-Media1.iso",
pending => 0,
picked_into => 189,
groups => {
139 => 2001197,
142 => 2001198,
151 => 2005127,
155 => 2027247,
158 => 2027799,
169 => 2001253,
189 => 2019737,
},
id => 275764,
max_job => 2027799,
name => "iso/SLE-12-SP4-Server-DVD-aarch64-Build0365-Media1.iso",
pending => 0,
picked_into => 189,
This looks all insane (keeping 7 builds while removing the latest), but the only suprising part is the one pending=1
Updated by coolo about 6 years ago
The strange priority on picking what to delete is obvious on resync:
openqa:/home/coolo # grep -C1 SLES-12-SP4-x86_64-03 /srv/log/openqa_gru
max_job => 2089602,
name => "hdd/SLES-12-SP4-x86_64-0395\@64bit-minimal_with_sdk0311_installed_withhome.qcow2",
pending => 1,
--
max_job => 2087256,
name => "hdd/SLES-12-SP4-x86_64-0395\@64bit-pcm_googlecloud.qcow2",
pending => 0,
--
max_job => 2087245,
name => "hdd/SLES-12-SP4-x86_64-0395\@64bit-pcm_aws.qcow2",
pending => 0,
--
max_job => 2087188,
name => "hdd/SLES-12-SP4-x86_64-0395\@64bit-pcm_azure.qcow2",
pending => 0,
--
max_job => 2087185,
name => "hdd/SLES-12-SP4-x86_64-0395\@64bit-minimal_with_sdk0311_installed.qcow2",
pending => 0,
--
max_job => 2087129,
name => "hdd/SLES-12-SP4-x86_64-0395\@64bit-minimal_with_sdk0311_pcm_installed.qcow2",
pending => 0,
--
max_job => 2073650,
name => "hdd/SLES-12-SP4-x86_64-0393\@64bit-pcm_aws.qcow2",
pending => 0,
--
max_job => 2073649,
name => "hdd/SLES-12-SP4-x86_64-0393\@64bit-pcm_googlecloud.qcow2",
pending => 0,
--
max_job => 2073020,
name => "hdd/SLES-12-SP4-x86_64-0393\@64bit-minimal_with_sdk0309_installed.qcow2",
pending => 0,
--
max_job => 2068838,
name => "hdd/SLES-12-SP4-x86_64-0393\@64bit-minimal_with_sdk0309_installed_withhome.qcow2",
pending => 0,
--
max_job => 2061191,
name => "hdd/SLES-12-SP4-x86_64-0391\@64bit-minimal_with_sdk0307_installed.qcow2",
pending => 0,
--
max_job => 2053519,
name => "hdd/SLES-12-SP4-x86_64-0387\@64bit-minimal_with_sdk0304_installed.qcow2",
pending => 0,
--
max_job => 2037870,
name => "hdd/SLES-12-SP4-x86_64-0378\@64bit-minimal_with_sdk0298_installed.qcow2",
pending => 0,
--
max_job => 2031324,
name => "hdd/SLES-12-SP4-x86_64-0377\@64bit-minimal_with_sdk0297_installed.qcow2",
pending => 0,
--
max_job => 2000997,
name => "hdd/SLES-12-SP4-x86_64-0352\@64bit-minimal_with_sdk0280_installed.qcow2",
pending => 1,
--
max_job => 2089602,
name => "hdd/SLES-12-SP4-x86_64-0395\@64bit-minimal_with_sdk0311_installed_withhome.qcow2",
pending => 1,
--
max_job => 2087256,
name => "hdd/SLES-12-SP4-x86_64-0395\@64bit-pcm_googlecloud.qcow2",
pending => 0,
--
max_job => 2087245,
name => "hdd/SLES-12-SP4-x86_64-0395\@64bit-pcm_aws.qcow2",
pending => 0,
--
max_job => 2087188,
name => "hdd/SLES-12-SP4-x86_64-0395\@64bit-pcm_azure.qcow2",
pending => 0,
--
max_job => 2087185,
name => "hdd/SLES-12-SP4-x86_64-0395\@64bit-minimal_with_sdk0311_installed.qcow2",
pending => 0,
--
max_job => 2087129,
name => "hdd/SLES-12-SP4-x86_64-0395\@64bit-minimal_with_sdk0311_pcm_installed.qcow2",
pending => 0,
--
max_job => 2073650,
name => "hdd/SLES-12-SP4-x86_64-0393\@64bit-pcm_aws.qcow2",
pending => 0,
--
max_job => 2073649,
name => "hdd/SLES-12-SP4-x86_64-0393\@64bit-pcm_googlecloud.qcow2",
pending => 0,
--
max_job => 2073020,
name => "hdd/SLES-12-SP4-x86_64-0393\@64bit-minimal_with_sdk0309_installed.qcow2",
pending => 0,
--
max_job => 2068838,
name => "hdd/SLES-12-SP4-x86_64-0393\@64bit-minimal_with_sdk0309_installed_withhome.qcow2",
pending => 0,
--
max_job => 2061191,
name => "hdd/SLES-12-SP4-x86_64-0391\@64bit-minimal_with_sdk0307_installed.qcow2",
pending => 0,
--
max_job => 2053519,
name => "hdd/SLES-12-SP4-x86_64-0387\@64bit-minimal_with_sdk0304_installed.qcow2",
pending => 0,
--
max_job => 2037870,
name => "hdd/SLES-12-SP4-x86_64-0378\@64bit-minimal_with_sdk0298_installed.qcow2",
pending => 0,
--
max_job => 2031324,
name => "hdd/SLES-12-SP4-x86_64-0377\@64bit-minimal_with_sdk0297_installed.qcow2",
pending => 0,
--
max_job => 2000997,
name => "hdd/SLES-12-SP4-x86_64-0352\@64bit-minimal_with_sdk0280_installed.qcow2",
pending => 1,
--
[2018-09-25T09:09:40.0471 CEST] [info] GRU: removing /var/lib/openqa/share/factory/hdd/SLES-12-SP3-x86_64-mru-install-minimal-with-addons-Build:8837:ghostscript.1537779325.qcow2
[2018-09-25T09:09:40.0509 CEST] [info] Removing asset hdd/SLES-12-SP4-x86_64-0393@64bit-pcm_aws.qcow2 (assigned to groups: 139)
[2018-09-25T09:09:40.0517 CEST] [info] GRU: removing /var/lib/openqa/share/factory/hdd/SLES-12-SP4-x86_64-0393@64bit-pcm_aws.qcow2
[2018-09-25T09:09:40.0560 CEST] [info] Removing asset hdd/SLES-12-SP4-x86_64-0393@64bit-pcm_googlecloud.qcow2 (assigned to groups: 139)
[2018-09-25T09:09:40.0573 CEST] [info] GRU: removing /var/lib/openqa/share/factory/hdd/SLES-12-SP4-x86_64-0393@64bit-pcm_googlecloud.qcow2
[2018-09-25T09:09:40.0596 CEST] [info] Removing asset repo/SLE-12-SP4-SERVER-POOL-ppc64le-Build0391-Media3 (assigned to groups: 155, 158, 151, 139, 157, 184, 132, 142)
Updated by coolo about 6 years ago
So one bug: 'pending' is calculated depending on 't_finished is null' - but that's true for skipped jobs too.
select distinct state,result from jobs where t_finished is null;
state | result
----------------+--------------------
running | none
user_cancelled | parallel_restarted
scheduled | none
skipped | parallel_restarted
user_cancelled | parallel_failed
uploading | none
Updated by coolo about 6 years ago
- Related to action #41543: parallel_failed jobs miss t_finished added
Updated by coolo about 6 years ago
- Target version set to Current Sprint
- Difficulty set to hard
Now that the pending calculcation is fixed, some old assets are removed and freeing for more important builds. Now awaiting the next full build to see if it's handled correctly
Updated by coolo about 6 years ago
- Priority changed from Urgent to High
There was one bug in the calculation of the pending state - but it doesn't explain the effect we see. The added debug output hopefully helps understanding the problem if it hits again
Updated by coolo about 6 years ago
So what's noteworthy - when the asset was removed, it only had "151, 155, 158, 169, 132" assigned
Now it has additionally 139 and 142 - which are also the much larger groups.
Updated by coolo about 6 years ago
We have no useful data on what could happen. This is just how it's supposed to be.
After sync, we see it as this:
[warn] Asset iso/SLE-12-SP4-Server-DVD-x86_64-Build0396-Media1.iso is not in any job group, will delete in 1.99846064814815 days
Then we schedule it and see it as:
[debug] Asset iso/SLE-12-SP4-Server-DVD-x86_64-Build0396-Media1.iso (4011851776) picked into 139
groups => {},
id => 319138,
max_job => undef,
name => "iso/SLE-12-SP4-Server-DVD-x86_64-Build0396-Media1.iso",
pending => 0,
picked_into => 0,
size => 4011851776,
--
},
id => 319138,
max_job => 2095195,
name => "iso/SLE-12-SP4-Server-DVD-x86_64-Build0396-Media1.iso",
pending => 1,
picked_into => 139,
size => 4011851776,
Updated by coolo about 6 years ago
- Priority changed from High to Urgent
Now we just had another case and it's quite obvious.
[2018-09-28T15:59:22.0841 CEST] [info] Registering asset iso/SLE-15-SP1-Installer-DVD-ppc64le-Build53.7-Media1.iso
And then it's looked at as very last in the join. But the infos about it are pretty obviously
showing the opposite:
{
fixed => 0,
groups => {
96 => 2104857,
110 => 2104851,
111 => 2104787,
112 => 2104811,
116 => 2104859,
129 => 2104853,
143 => 2104746,
146 => 2104824,
},
id => 326573,
max_job => undef,
name => "iso/SLE-15-SP1-Installer-DVD-ppc64le-Build53.7-Media1.iso",
pending => 0,
picked_into => 0,
size => 624281600,
t_created => "2018-09-28 13:59:22",
type => "iso",
},
So we have no max_job at all, but when we look for groups, we see a lot of them. Which means when the initial
join through the assets was done, the asset wasn't yet scheduled - but when the query per group was done,
we got some.
But as the assets are looked in the initial order, it no longer fits - but as it has groups, it's deleted.
Updated by nicksinger about 6 years ago
- Subject changed from [tools] medium that should belong to job group was deleted after just 2 minutes making our SLE15 tests in build 47.1 useless to [tools] medium that should belong to job group was deleted after just 2 minutes making our SLE15 tests in build 47.1 useless ... and more
Updated by mkittler about 6 years ago
Updated by coolo about 6 years ago
- Status changed from New to Resolved
We're confident that this will fix the issue - and so far we didn't see the problem since hotdeploying the first fix
Updated by coolo about 6 years ago
- Target version changed from Current Sprint to Done
Updated by mkittler about 6 years ago
- Related to action #45041: Asset cache cleanup still fails if new jobs are created at the same time added