action #19672: GRU may delete assets while jobs are registered - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #19672

closed

GRU may delete assets while jobs are registered

Added by dimstar over 7 years ago. Updated almost 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

coolo

Category:

Regressions/Crashes

Target version:

Done

Start date:

2017-06-08

Due date:

% Done:

Estimated time:

Description

Sample job: https://openqa.opensuse.org/tests/416444

This job started right at the moment a new snapshot was sent to openQA - together with many others, it was 'incomplete' right away and apparently obsoleted by another job

The factory dahsboard currently shows: passed: 78 softfailed: 19 failed: 14 obsoleted: 53 incomplete: 7

and totest manager es equally confused by this, and it generally takes a couple retriggers of jobs that don't even appear in the openQA overview of a snapshot, making the entire process quite cumbersome now.

This has been observed for the last couple snapshots, roughly about a week ago the first time

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by coolo over 7 years ago

Assignee set to coolo

I will look into it - the deployment is pretty old, possibly this is already fixed

Actions

Copy link

Updated by mlin7442 over 7 years ago

for example,

https://openqa.opensuse.org/api/v1/jobs/418447 <= clone_id:null, result:imcomplete
https://openqa.opensuse.org/api/v1/jobs/418588

as 418447 was not an "obsoleted" job then ttm thinks it is valid.

Actions

Copy link

Updated by coolo over 7 years ago

It was clone_id:null until dimstar retriggered it :(

Actions

Copy link

Updated by coolo over 7 years ago

this looks like a problem with the way the ISO is setup - I'm trying with a SQL dump of o3

Actions

Copy link

Updated by coolo over 7 years ago

09:46:43 POST /api/v1/isos created 418447
09:52:06 POST /api/v1/isos created 418588

This maps to

[10/Jun/2017:09:46:42 +0000] "POST /api/v1/isos?
ISO=openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170609-Media.iso
&REPO_0_DEBUGINFO=openSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609-debuginfo
&BUILD=20170609
&VERSION=Tumbleweed
&SUSEMIRROR=http%3A%2F%2Fopenqa.opensuse.org%2Fassets%2Frepo%2FopenSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609
&FLAVOR=DVD
&FULLURL=1
&MIRROR_HTTP=http%3A%2F%2Fopenqa.opensuse.org%2Fassets%2Frepo%2FopenSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609
&ARCH=x86_64
&DISTRI=opensuse
&REPO_0=openSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609 HTTP/1.1" 200 1018 "-" "Mojolicious (Perl)"

and

"POST /api/v1/isos?
SUSEMIRROR=http%3A%2F%2Fopenqa.opensuse.org%2Fassets%2Frepo%2FopenSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609
&VERSION=Tumbleweed
&REPO_0_DEBUGINFO=openSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609-debuginfo
&ARCH=x86_64
&REPO_0=openSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609
&FLAVOR=DVD
&BUILD=20170609
&DISTRI=opensuse
&MIRROR_HTTP=http%3A%2F%2Fopenqa.opensuse.org%2Fassets%2Frepo%2FopenSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609
&FULLURL=1
&ISO=openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170609-Media.iso HTTP/1.1" 200 1018 "-" "Mojolicious (Perl)"

Looks like dups - unless I missed something.

The rsync.log is a bit sparse, but for completeness:

rsync start at Sat Jun 10 09:35:01 UTC 2017
settings for local/_product:openSUSE-Kubic-dvd5-DVD-x86_64/openSUSE-Kubic-DVD-x86_64-Snapshot20170609-Media.iso is neither hash nor sub!
openSUSE-Tumbleweed-DVD-s390x-Snapshot20170601-Media.iso: repo openSUSE-Tumbleweed-oss-s390x-Snapshot20170601 missing, skipped
++ WARN: from_733: broken byte order
++ WARN: from_733: broken byte order
kernel-default-4.11.3-1.5.nosrc.rpm -> kernel-source-4.11.3-1.5.src.rpm
kernel-default-4.11.3-1.5.nosrc.rpm -> kernel-source-4.11.3-1.5.src.rpm
kernel-syms-4.11.3-1.5.src.rpm -> kernel-source-4.11.3-1.5.src.rpm
++ WARN: from_733: broken byte order
++ WARN: from_733: broken byte order
kernel-docs-4.11.3-1.5.src.rpm -> kernel-source-4.11.3-1.5.src.rpm
rsync done at Sat Jun 10 09:46:52 UTC 2017
rsync start at Sat Jun 10 09:50:01 UTC 2017
settings for local/_product:openSUSE-Kubic-dvd5-DVD-x86_64/openSUSE-Kubic-DVD-x86_64-Snapshot20170609-Media.iso is neither hash nor sub!
openSUSE-Tumbleweed-DVD-s390x-Snapshot20170601-Media.iso: repo openSUSE-Tumbleweed-oss-s390x-Snapshot20170601 missing, skipped
++ WARN: from_733: broken byte order
++ WARN: from_733: broken byte order
kernel-default-4.11.3-1.5.nosrc.rpm -> kernel-source-4.11.3-1.5.src.rpm
kernel-default-4.11.3-1.5.nosrc.rpm -> kernel-source-4.11.3-1.5.src.rpm
kernel-syms-4.11.3-1.5.src.rpm -> kernel-source-4.11.3-1.5.src.rpm
++ WARN: from_733: broken byte order
++ WARN: from_733: broken byte order
kernel-docs-4.11.3-1.5.src.rpm -> kernel-source-4.11.3-1.5.src.rpm
rsync done at Sat Jun 10 09:52:16 UTC 2017

To me this looks like a rsync.pl problem - it schedules the DVD iso before it's synced. And then again when it's there

Actions

Copy link

Updated by coolo over 7 years ago

Assignee deleted (~~coolo~~)

rsync.pl is running with --verbose now - but today's snapshot was scheduled only once

Actions

Copy link

Updated by coolo over 7 years ago

GRU gave the ISO 30 seconds :)

[Mon Jun 19 03:26:30 2017] [3543:debug] [DBIx debug] Took 0.00251794 seconds executed: INSERT INTO assets ( name, t_created, t_updated, type) VALUES ( ?, ?, ?, ? ) RETURNING id: 'openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170618-Media.iso', '2017-06-19 03:26:30', '2017-06-19 03:26:30', 'iso'.
[Mon Jun 19 03:26:56 2017] [19427:debug] Group Development Tumbleweed: iso/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170618-Media.iso 0KiB->-36GiB
[Mon Jun 19 03:26:57 2017] [19427:info] GRU: removing /var/lib/openqa/share/factory/iso/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170618-Media.iso

Actions

Copy link

Updated by coolo over 7 years ago

Subject changed from openqa schedules jobs before assets are ready to GRU may delete assets while jobs are registered
Priority changed from High to Normal

The problem is a bit special, but basically:

we have 2 job groups scheduled against the ISO. When scheduling, we schedule the smaller job group first - jobs are created, but the job group is already below quota at the beginning. GRU will run and find the ISO helping the quota -> removes. This happens basically in parallel to the second job being scheduling. The work around applied is making the other job group large enough to actually hold assets.

This is clearly a bug - if the job group was below quota, the newly added asset can't hardly fix it. But I have no good idea how to properly fix the underlying problem - next to give asset a minimum time to life

Actions

Copy link

Updated by coolo about 7 years ago

Category set to Regressions/Crashes
Target version set to Ready

GRU should give some grace period for assets - entrance level issue

Actions

Copy link

#10

Updated by okurz about 7 years ago

Related to action #28328: job was triggered trying to download HDD image but it's already gone added

Actions

Copy link

#11

Updated by AdamWill almost 7 years ago

Should we consider https://github.com/os-autoinst/openQA/pull/1518 as fixing this?

Actions

Copy link

#12

Updated by coolo almost 7 years ago

Status changed from New to Resolved

ok, let's hope for the best

Actions

Copy link

#13

Updated by coolo almost 7 years ago

Status changed from Resolved to New

23 seconds asset:

06:49 < coolo> [2018-01-25T21:11:22.0972 CET] [info] Registering asset hdd/sle-15-x86_64-423.2-textmode@64bit.qcow2.TEMP-bwBEAWYH
06:49 < coolo> [2018-01-25T21:11:45.0362 CET] [info] GRU: removing /var/lib/openqa/share/factory/hdd/sle-15-x86_64-423.2-textmode@64bit.qcow2

Actions

Copy link

#14

Updated by okurz almost 7 years ago

examples from SLE15 aarch64 build 423.2:

Actions

Copy link

#15

Updated by szarate almost 7 years ago

Assignee set to coolo
Target version changed from Ready to Current Sprint

Actions

Copy link

#16

Updated by coolo almost 7 years ago

Status changed from New to Resolved

I reworked the GRU algorithm and I sure hope it won't do this anymore. https://github.com/os-autoinst/openQA/pull/1569

Actions

Copy link

#17

Updated by coolo almost 7 years ago

meanwhile I think I know what happened. The loop deletes all assets that have size 0 - but the ensure_size is only called at the beginning of the function. So if between ensuring sizes and querying assets to be deleted a new asset is registered, it will gladly be removed as undef == 0 in perl.

My new code does not suffer from that as at the beginning I create a hash from the DB and then only work on this hash instead of doing more queries. The job groups are queried, but if new assets appeared in between they are ignored

Actions

Copy link

#18