action #19672
closedGRU may delete assets while jobs are registered
Description
Sample job: https://openqa.opensuse.org/tests/416444
This job started right at the moment a new snapshot was sent to openQA - together with many others, it was 'incomplete' right away and apparently obsoleted by another job
The factory dahsboard currently shows: passed: 78 softfailed: 19 failed: 14 obsoleted: 53 incomplete: 7
and totest manager es equally confused by this, and it generally takes a couple retriggers of jobs that don't even appear in the openQA overview of a snapshot, making the entire process quite cumbersome now.
This has been observed for the last couple snapshots, roughly about a week ago the first time
Updated by coolo over 7 years ago
- Assignee set to coolo
I will look into it - the deployment is pretty old, possibly this is already fixed
Updated by mlin7442 over 7 years ago
for example,
https://openqa.opensuse.org/api/v1/jobs/418447 <= clone_id:null, result:imcomplete
https://openqa.opensuse.org/api/v1/jobs/418588
as 418447 was not an "obsoleted" job then ttm thinks it is valid.
Updated by coolo over 7 years ago
It was clone_id:null until dimstar retriggered it :(
Updated by coolo over 7 years ago
this looks like a problem with the way the ISO is setup - I'm trying with a SQL dump of o3
Updated by coolo over 7 years ago
09:46:43 POST /api/v1/isos created 418447
09:52:06 POST /api/v1/isos created 418588
This maps to
[10/Jun/2017:09:46:42 +0000] "POST /api/v1/isos?
ISO=openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170609-Media.iso
&REPO_0_DEBUGINFO=openSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609-debuginfo
&BUILD=20170609
&VERSION=Tumbleweed
&SUSEMIRROR=http%3A%2F%2Fopenqa.opensuse.org%2Fassets%2Frepo%2FopenSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609
&FLAVOR=DVD
&FULLURL=1
&MIRROR_HTTP=http%3A%2F%2Fopenqa.opensuse.org%2Fassets%2Frepo%2FopenSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609
&ARCH=x86_64
&DISTRI=opensuse
&REPO_0=openSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609 HTTP/1.1" 200 1018 "-" "Mojolicious (Perl)"
and
"POST /api/v1/isos?
SUSEMIRROR=http%3A%2F%2Fopenqa.opensuse.org%2Fassets%2Frepo%2FopenSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609
&VERSION=Tumbleweed
&REPO_0_DEBUGINFO=openSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609-debuginfo
&ARCH=x86_64
&REPO_0=openSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609
&FLAVOR=DVD
&BUILD=20170609
&DISTRI=opensuse
&MIRROR_HTTP=http%3A%2F%2Fopenqa.opensuse.org%2Fassets%2Frepo%2FopenSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609
&FULLURL=1
&ISO=openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170609-Media.iso HTTP/1.1" 200 1018 "-" "Mojolicious (Perl)"
Looks like dups - unless I missed something.
The rsync.log is a bit sparse, but for completeness:
rsync start at Sat Jun 10 09:35:01 UTC 2017
settings for local/_product:openSUSE-Kubic-dvd5-DVD-x86_64/openSUSE-Kubic-DVD-x86_64-Snapshot20170609-Media.iso is neither hash nor sub!
openSUSE-Tumbleweed-DVD-s390x-Snapshot20170601-Media.iso: repo openSUSE-Tumbleweed-oss-s390x-Snapshot20170601 missing, skipped
++ WARN: from_733: broken byte order
++ WARN: from_733: broken byte order
kernel-default-4.11.3-1.5.nosrc.rpm -> kernel-source-4.11.3-1.5.src.rpm
kernel-default-4.11.3-1.5.nosrc.rpm -> kernel-source-4.11.3-1.5.src.rpm
kernel-syms-4.11.3-1.5.src.rpm -> kernel-source-4.11.3-1.5.src.rpm
++ WARN: from_733: broken byte order
++ WARN: from_733: broken byte order
kernel-docs-4.11.3-1.5.src.rpm -> kernel-source-4.11.3-1.5.src.rpm
rsync done at Sat Jun 10 09:46:52 UTC 2017
rsync start at Sat Jun 10 09:50:01 UTC 2017
settings for local/_product:openSUSE-Kubic-dvd5-DVD-x86_64/openSUSE-Kubic-DVD-x86_64-Snapshot20170609-Media.iso is neither hash nor sub!
openSUSE-Tumbleweed-DVD-s390x-Snapshot20170601-Media.iso: repo openSUSE-Tumbleweed-oss-s390x-Snapshot20170601 missing, skipped
++ WARN: from_733: broken byte order
++ WARN: from_733: broken byte order
kernel-default-4.11.3-1.5.nosrc.rpm -> kernel-source-4.11.3-1.5.src.rpm
kernel-default-4.11.3-1.5.nosrc.rpm -> kernel-source-4.11.3-1.5.src.rpm
kernel-syms-4.11.3-1.5.src.rpm -> kernel-source-4.11.3-1.5.src.rpm
++ WARN: from_733: broken byte order
++ WARN: from_733: broken byte order
kernel-docs-4.11.3-1.5.src.rpm -> kernel-source-4.11.3-1.5.src.rpm
rsync done at Sat Jun 10 09:52:16 UTC 2017
To me this looks like a rsync.pl problem - it schedules the DVD iso before it's synced. And then again when it's there
Updated by coolo over 7 years ago
- Assignee deleted (
coolo)
rsync.pl is running with --verbose now - but today's snapshot was scheduled only once
Updated by coolo over 7 years ago
GRU gave the ISO 30 seconds :)
[Mon Jun 19 03:26:30 2017] [3543:debug] [DBIx debug] Took 0.00251794 seconds executed: INSERT INTO assets ( name, t_created, t_updated, type) VALUES ( ?, ?, ?, ? ) RETURNING id: 'openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170618-Media.iso', '2017-06-19 03:26:30', '2017-06-19 03:26:30', 'iso'.
[Mon Jun 19 03:26:56 2017] [19427:debug] Group Development Tumbleweed: iso/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170618-Media.iso 0KiB->-36GiB
[Mon Jun 19 03:26:57 2017] [19427:info] GRU: removing /var/lib/openqa/share/factory/iso/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170618-Media.iso
Updated by coolo over 7 years ago
- Subject changed from openqa schedules jobs before assets are ready to GRU may delete assets while jobs are registered
- Priority changed from High to Normal
The problem is a bit special, but basically:
we have 2 job groups scheduled against the ISO. When scheduling, we schedule the smaller job group first - jobs are created, but the job group is already below quota at the beginning. GRU will run and find the ISO helping the quota -> removes. This happens basically in parallel to the second job being scheduling. The work around applied is making the other job group large enough to actually hold assets.
This is clearly a bug - if the job group was below quota, the newly added asset can't hardly fix it. But I have no good idea how to properly fix the underlying problem - next to give asset a minimum time to life
Updated by coolo about 7 years ago
- Category set to Regressions/Crashes
- Target version set to Ready
GRU should give some grace period for assets - entrance level issue
Updated by okurz about 7 years ago
- Related to action #28328: job was triggered trying to download HDD image but it's already gone added
Updated by AdamWill almost 7 years ago
Should we consider https://github.com/os-autoinst/openQA/pull/1518 as fixing this?
Updated by coolo almost 7 years ago
- Status changed from Resolved to New
23 seconds asset:
06:49 < coolo> [2018-01-25T21:11:22.0972 CET] [info] Registering asset hdd/sle-15-x86_64-423.2-textmode@64bit.qcow2.TEMP-bwBEAWYH
06:49 < coolo> [2018-01-25T21:11:45.0362 CET] [info] GRU: removing /var/lib/openqa/share/factory/hdd/sle-15-x86_64-423.2-textmode@64bit.qcow2
Updated by okurz almost 7 years ago
examples from SLE15 aarch64 build 423.2:
Updated by szarate almost 7 years ago
- Assignee set to coolo
- Target version changed from Ready to Current Sprint
Updated by coolo almost 7 years ago
- Status changed from New to Resolved
I reworked the GRU algorithm and I sure hope it won't do this anymore. https://github.com/os-autoinst/openQA/pull/1569
Updated by coolo almost 7 years ago
meanwhile I think I know what happened. The loop deletes all assets that have size 0 - but the ensure_size is only called at the beginning of the function. So if between ensuring sizes and querying assets to be deleted a new asset is registered, it will gladly be removed as undef == 0 in perl.
My new code does not suffer from that as at the beginning I create a hash from the DB and then only work on this hash instead of doing more queries. The job groups are queried, but if new assets appeared in between they are ignored
Updated by okurz almost 7 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: hpc_hwloc
https://openqa.suse.de/tests/1469528
Updated by szarate almost 7 years ago
- Target version changed from Current Sprint to Done
Updated by mkittler almost 6 years ago
- Related to action #45041: Asset cache cleanup still fails if new jobs are created at the same time added
Updated by okurz over 4 years ago
- Related to action #64544: Asset required by scheduled job wiped by limit_assets added