action #19672

GRU may delete assets while jobs are registered

Added by dimstar over 2 years ago. Updated about 2 years ago.

Status:ResolvedStart date:08/06/2017
Priority:NormalDue date:
Assignee:coolo% Done:

0%

Category:Concrete Bugs
Target version:Done
Difficulty:
Duration:

Description

Sample job: https://openqa.opensuse.org/tests/416444

This job started right at the moment a new snapshot was sent to openQA - together with many others, it was 'incomplete' right away and apparently obsoleted by another job

The factory dahsboard currently shows: passed: 78 softfailed: 19 failed: 14 obsoleted: 53 incomplete: 7

and totest manager es equally confused by this, and it generally takes a couple retriggers of jobs that don't even appear in the openQA overview of a snapshot, making the entire process quite cumbersome now.

This has been observed for the last couple snapshots, roughly about a week ago the first time


Related issues

Related to openQA Project - action #28328: job was triggered trying to download HDD image but it's a... Rejected 24/11/2017
Related to openQA Project - action #45041: Asset cache cleanup still fails if new jobs are created a... Resolved 12/12/2018

History

#1 Updated by coolo over 2 years ago

  • Assignee set to coolo

I will look into it - the deployment is pretty old, possibly this is already fixed

#2 Updated by mlin7442 over 2 years ago

for example,

https://openqa.opensuse.org/api/v1/jobs/418447 <= clone_id:null, result:imcomplete
https://openqa.opensuse.org/api/v1/jobs/418588

as 418447 was not an "obsoleted" job then ttm thinks it is valid.

#3 Updated by coolo over 2 years ago

It was clone_id:null until dimstar retriggered it :(

#4 Updated by coolo over 2 years ago

this looks like a problem with the way the ISO is setup - I'm trying with a SQL dump of o3

#5 Updated by coolo over 2 years ago

09:46:43 POST /api/v1/isos created 418447
09:52:06 POST /api/v1/isos created 418588

This maps to

[10/Jun/2017:09:46:42 +0000] "POST /api/v1/isos?
ISO=openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170609-Media.iso
&REPO_0_DEBUGINFO=openSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609-debuginfo
&BUILD=20170609
&VERSION=Tumbleweed
&SUSEMIRROR=http%3A%2F%2Fopenqa.opensuse.org%2Fassets%2Frepo%2FopenSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609
&FLAVOR=DVD
&FULLURL=1
&MIRROR_HTTP=http%3A%2F%2Fopenqa.opensuse.org%2Fassets%2Frepo%2FopenSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609
&ARCH=x86_64
&DISTRI=opensuse
&REPO_0=openSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609 HTTP/1.1" 200 1018 "-" "Mojolicious (Perl)"

and

"POST /api/v1/isos?
SUSEMIRROR=http%3A%2F%2Fopenqa.opensuse.org%2Fassets%2Frepo%2FopenSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609
&VERSION=Tumbleweed
&REPO_0_DEBUGINFO=openSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609-debuginfo
&ARCH=x86_64
&REPO_0=openSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609
&FLAVOR=DVD
&BUILD=20170609
&DISTRI=opensuse
&MIRROR_HTTP=http%3A%2F%2Fopenqa.opensuse.org%2Fassets%2Frepo%2FopenSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609
&FULLURL=1
&ISO=openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170609-Media.iso HTTP/1.1" 200 1018 "-" "Mojolicious (Perl)"

Looks like dups - unless I missed something.

The rsync.log is a bit sparse, but for completeness:

rsync start at Sat Jun 10 09:35:01 UTC 2017
settings for local/_product:openSUSE-Kubic-dvd5-DVD-x86_64/openSUSE-Kubic-DVD-x86_64-Snapshot20170609-Media.iso is neither hash nor sub!
openSUSE-Tumbleweed-DVD-s390x-Snapshot20170601-Media.iso: repo openSUSE-Tumbleweed-oss-s390x-Snapshot20170601 missing, skipped
++ WARN: from_733: broken byte order
++ WARN: from_733: broken byte order
kernel-default-4.11.3-1.5.nosrc.rpm -> kernel-source-4.11.3-1.5.src.rpm
kernel-default-4.11.3-1.5.nosrc.rpm -> kernel-source-4.11.3-1.5.src.rpm
kernel-syms-4.11.3-1.5.src.rpm -> kernel-source-4.11.3-1.5.src.rpm
++ WARN: from_733: broken byte order
++ WARN: from_733: broken byte order
kernel-docs-4.11.3-1.5.src.rpm -> kernel-source-4.11.3-1.5.src.rpm
rsync done at Sat Jun 10 09:46:52 UTC 2017
rsync start at Sat Jun 10 09:50:01 UTC 2017
settings for local/_product:openSUSE-Kubic-dvd5-DVD-x86_64/openSUSE-Kubic-DVD-x86_64-Snapshot20170609-Media.iso is neither hash nor sub!
openSUSE-Tumbleweed-DVD-s390x-Snapshot20170601-Media.iso: repo openSUSE-Tumbleweed-oss-s390x-Snapshot20170601 missing, skipped
++ WARN: from_733: broken byte order
++ WARN: from_733: broken byte order
kernel-default-4.11.3-1.5.nosrc.rpm -> kernel-source-4.11.3-1.5.src.rpm
kernel-default-4.11.3-1.5.nosrc.rpm -> kernel-source-4.11.3-1.5.src.rpm
kernel-syms-4.11.3-1.5.src.rpm -> kernel-source-4.11.3-1.5.src.rpm
++ WARN: from_733: broken byte order
++ WARN: from_733: broken byte order
kernel-docs-4.11.3-1.5.src.rpm -> kernel-source-4.11.3-1.5.src.rpm
rsync done at Sat Jun 10 09:52:16 UTC 2017

To me this looks like a rsync.pl problem - it schedules the DVD iso before it's synced. And then again when it's there

#6 Updated by coolo over 2 years ago

  • Assignee deleted (coolo)

rsync.pl is running with --verbose now - but today's snapshot was scheduled only once

#7 Updated by coolo over 2 years ago

GRU gave the ISO 30 seconds :)

[Mon Jun 19 03:26:30 2017] [3543:debug] [DBIx debug] Took 0.00251794 seconds executed: INSERT INTO assets ( name, t_created, t_updated, type) VALUES ( ?, ?, ?, ? ) RETURNING id: 'openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170618-Media.iso', '2017-06-19 03:26:30', '2017-06-19 03:26:30', 'iso'.
[Mon Jun 19 03:26:56 2017] [19427:debug] Group Development Tumbleweed: iso/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170618-Media.iso 0KiB->-36GiB
[Mon Jun 19 03:26:57 2017] [19427:info] GRU: removing /var/lib/openqa/share/factory/iso/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170618-Media.iso

#8 Updated by coolo over 2 years ago

  • Subject changed from openqa schedules jobs before assets are ready to GRU may delete assets while jobs are registered
  • Priority changed from High to Normal

The problem is a bit special, but basically:

we have 2 job groups scheduled against the ISO. When scheduling, we schedule the smaller job group first - jobs are created, but the job group is already below quota at the beginning. GRU will run and find the ISO helping the quota -> removes. This happens basically in parallel to the second job being scheduling. The work around applied is making the other job group large enough to actually hold assets.

This is clearly a bug - if the job group was below quota, the newly added asset can't hardly fix it. But I have no good idea how to properly fix the underlying problem - next to give asset a minimum time to life

#9 Updated by coolo over 2 years ago

  • Category set to Concrete Bugs
  • Target version set to Ready

GRU should give some grace period for assets - entrance level issue

#10 Updated by okurz over 2 years ago

  • Related to action #28328: job was triggered trying to download HDD image but it's already gone added

#11 Updated by AdamWill about 2 years ago

Should we consider https://github.com/os-autoinst/openQA/pull/1518 as fixing this?

#12 Updated by coolo about 2 years ago

  • Status changed from New to Resolved

ok, let's hope for the best

#13 Updated by coolo about 2 years ago

  • Status changed from Resolved to New

23 seconds asset:

06:49 < coolo> [2018-01-25T21:11:22.0972 CET] [info] Registering asset hdd/sle-15-x86_64-423.2-textmode@64bit.qcow2.TEMP-bwBEAWYH
06:49 < coolo> [2018-01-25T21:11:45.0362 CET] [info] GRU: removing /var/lib/openqa/share/factory/hdd/sle-15-x86_64-423.2-textmode@64bit.qcow2

#15 Updated by szarate about 2 years ago

  • Assignee set to coolo
  • Target version changed from Ready to Current Sprint

#16 Updated by coolo about 2 years ago

  • Status changed from New to Resolved

I reworked the GRU algorithm and I sure hope it won't do this anymore. https://github.com/os-autoinst/openQA/pull/1569

#17 Updated by coolo about 2 years ago

meanwhile I think I know what happened. The loop deletes all assets that have size 0 - but the ensure_size is only called at the beginning of the function. So if between ensuring sizes and querying assets to be deleted a new asset is registered, it will gladly be removed as undef == 0 in perl.

My new code does not suffer from that as at the beginning I create a hash from the DB and then only work on this hash instead of doing more queries. The job groups are queried, but if new assets appeared in between they are ignored

#18 Updated by okurz about 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: hpc_hwloc
https://openqa.suse.de/tests/1469528

#19 Updated by szarate about 2 years ago

  • Target version changed from Current Sprint to Done

#20 Updated by mkittler about 1 year ago

  • Related to action #45041: Asset cache cleanup still fails if new jobs are created at the same time added

Also available in: Atom PDF