Project

General

Profile

Actions

action #19672

closed

GRU may delete assets while jobs are registered

Added by dimstar almost 7 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2017-06-08
Due date:
% Done:

0%

Estimated time:

Description

Sample job: https://openqa.opensuse.org/tests/416444

This job started right at the moment a new snapshot was sent to openQA - together with many others, it was 'incomplete' right away and apparently obsoleted by another job

The factory dahsboard currently shows: passed: 78 softfailed: 19 failed: 14 obsoleted: 53 incomplete: 7

and totest manager es equally confused by this, and it generally takes a couple retriggers of jobs that don't even appear in the openQA overview of a snapshot, making the entire process quite cumbersome now.

This has been observed for the last couple snapshots, roughly about a week ago the first time


Related issues 3 (0 open3 closed)

Related to openQA Project - action #28328: job was triggered trying to download HDD image but it's already goneRejectedokurz2017-11-24

Actions
Related to openQA Project - action #45041: Asset cache cleanup still fails if new jobs are created at the same timeResolvedmkittler2018-12-12

Actions
Related to openQA Project - action #64544: Asset required by scheduled job wiped by limit_assetsRejectedokurz2020-03-17

Actions
Actions #1

Updated by coolo almost 7 years ago

  • Assignee set to coolo

I will look into it - the deployment is pretty old, possibly this is already fixed

Actions #2

Updated by mlin7442 almost 7 years ago

for example,

https://openqa.opensuse.org/api/v1/jobs/418447 <= clone_id:null, result:imcomplete
https://openqa.opensuse.org/api/v1/jobs/418588

as 418447 was not an "obsoleted" job then ttm thinks it is valid.

Actions #3

Updated by coolo almost 7 years ago

It was clone_id:null until dimstar retriggered it :(

Actions #4

Updated by coolo almost 7 years ago

this looks like a problem with the way the ISO is setup - I'm trying with a SQL dump of o3

Actions #5

Updated by coolo almost 7 years ago

09:46:43 POST /api/v1/isos created 418447
09:52:06 POST /api/v1/isos created 418588

This maps to

[10/Jun/2017:09:46:42 +0000] "POST /api/v1/isos?
ISO=openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170609-Media.iso
&REPO_0_DEBUGINFO=openSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609-debuginfo
&BUILD=20170609
&VERSION=Tumbleweed
&SUSEMIRROR=http%3A%2F%2Fopenqa.opensuse.org%2Fassets%2Frepo%2FopenSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609
&FLAVOR=DVD
&FULLURL=1
&MIRROR_HTTP=http%3A%2F%2Fopenqa.opensuse.org%2Fassets%2Frepo%2FopenSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609
&ARCH=x86_64
&DISTRI=opensuse
&REPO_0=openSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609 HTTP/1.1" 200 1018 "-" "Mojolicious (Perl)"

and

"POST /api/v1/isos?
SUSEMIRROR=http%3A%2F%2Fopenqa.opensuse.org%2Fassets%2Frepo%2FopenSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609
&VERSION=Tumbleweed
&REPO_0_DEBUGINFO=openSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609-debuginfo
&ARCH=x86_64
&REPO_0=openSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609
&FLAVOR=DVD
&BUILD=20170609
&DISTRI=opensuse
&MIRROR_HTTP=http%3A%2F%2Fopenqa.opensuse.org%2Fassets%2Frepo%2FopenSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170609
&FULLURL=1
&ISO=openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170609-Media.iso HTTP/1.1" 200 1018 "-" "Mojolicious (Perl)"

Looks like dups - unless I missed something.

The rsync.log is a bit sparse, but for completeness:

rsync start at Sat Jun 10 09:35:01 UTC 2017
settings for local/_product:openSUSE-Kubic-dvd5-DVD-x86_64/openSUSE-Kubic-DVD-x86_64-Snapshot20170609-Media.iso is neither hash nor sub!
openSUSE-Tumbleweed-DVD-s390x-Snapshot20170601-Media.iso: repo openSUSE-Tumbleweed-oss-s390x-Snapshot20170601 missing, skipped
++ WARN: from_733: broken byte order
++ WARN: from_733: broken byte order
kernel-default-4.11.3-1.5.nosrc.rpm -> kernel-source-4.11.3-1.5.src.rpm
kernel-default-4.11.3-1.5.nosrc.rpm -> kernel-source-4.11.3-1.5.src.rpm
kernel-syms-4.11.3-1.5.src.rpm -> kernel-source-4.11.3-1.5.src.rpm
++ WARN: from_733: broken byte order
++ WARN: from_733: broken byte order
kernel-docs-4.11.3-1.5.src.rpm -> kernel-source-4.11.3-1.5.src.rpm
rsync done at Sat Jun 10 09:46:52 UTC 2017
rsync start at Sat Jun 10 09:50:01 UTC 2017
settings for local/_product:openSUSE-Kubic-dvd5-DVD-x86_64/openSUSE-Kubic-DVD-x86_64-Snapshot20170609-Media.iso is neither hash nor sub!
openSUSE-Tumbleweed-DVD-s390x-Snapshot20170601-Media.iso: repo openSUSE-Tumbleweed-oss-s390x-Snapshot20170601 missing, skipped
++ WARN: from_733: broken byte order
++ WARN: from_733: broken byte order
kernel-default-4.11.3-1.5.nosrc.rpm -> kernel-source-4.11.3-1.5.src.rpm
kernel-default-4.11.3-1.5.nosrc.rpm -> kernel-source-4.11.3-1.5.src.rpm
kernel-syms-4.11.3-1.5.src.rpm -> kernel-source-4.11.3-1.5.src.rpm
++ WARN: from_733: broken byte order
++ WARN: from_733: broken byte order
kernel-docs-4.11.3-1.5.src.rpm -> kernel-source-4.11.3-1.5.src.rpm
rsync done at Sat Jun 10 09:52:16 UTC 2017

To me this looks like a rsync.pl problem - it schedules the DVD iso before it's synced. And then again when it's there

Actions #6

Updated by coolo almost 7 years ago

  • Assignee deleted (coolo)

rsync.pl is running with --verbose now - but today's snapshot was scheduled only once

Actions #7

Updated by coolo almost 7 years ago

GRU gave the ISO 30 seconds :)

[Mon Jun 19 03:26:30 2017] [3543:debug] [DBIx debug] Took 0.00251794 seconds executed: INSERT INTO assets ( name, t_created, t_updated, type) VALUES ( ?, ?, ?, ? ) RETURNING id: 'openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170618-Media.iso', '2017-06-19 03:26:30', '2017-06-19 03:26:30', 'iso'.
[Mon Jun 19 03:26:56 2017] [19427:debug] Group Development Tumbleweed: iso/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170618-Media.iso 0KiB->-36GiB
[Mon Jun 19 03:26:57 2017] [19427:info] GRU: removing /var/lib/openqa/share/factory/iso/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170618-Media.iso
Actions #8

Updated by coolo almost 7 years ago

  • Subject changed from openqa schedules jobs before assets are ready to GRU may delete assets while jobs are registered
  • Priority changed from High to Normal

The problem is a bit special, but basically:

we have 2 job groups scheduled against the ISO. When scheduling, we schedule the smaller job group first - jobs are created, but the job group is already below quota at the beginning. GRU will run and find the ISO helping the quota -> removes. This happens basically in parallel to the second job being scheduling. The work around applied is making the other job group large enough to actually hold assets.

This is clearly a bug - if the job group was below quota, the newly added asset can't hardly fix it. But I have no good idea how to properly fix the underlying problem - next to give asset a minimum time to life

Actions #9

Updated by coolo over 6 years ago

  • Category set to Regressions/Crashes
  • Target version set to Ready

GRU should give some grace period for assets - entrance level issue

Actions #10

Updated by okurz over 6 years ago

  • Related to action #28328: job was triggered trying to download HDD image but it's already gone added
Actions #11

Updated by AdamWill over 6 years ago

Should we consider https://github.com/os-autoinst/openQA/pull/1518 as fixing this?

Actions #12

Updated by coolo over 6 years ago

  • Status changed from New to Resolved

ok, let's hope for the best

Actions #13

Updated by coolo about 6 years ago

  • Status changed from Resolved to New

23 seconds asset:

06:49 < coolo> [2018-01-25T21:11:22.0972 CET] [info] Registering asset hdd/sle-15-x86_64-423.2-textmode@64bit.qcow2.TEMP-bwBEAWYH
06:49 < coolo> [2018-01-25T21:11:45.0362 CET] [info] GRU: removing /var/lib/openqa/share/factory/hdd/sle-15-x86_64-423.2-textmode@64bit.qcow2

Actions #15

Updated by szarate about 6 years ago

  • Assignee set to coolo
  • Target version changed from Ready to Current Sprint
Actions #16

Updated by coolo about 6 years ago

  • Status changed from New to Resolved

I reworked the GRU algorithm and I sure hope it won't do this anymore. https://github.com/os-autoinst/openQA/pull/1569

Actions #17

Updated by coolo about 6 years ago

meanwhile I think I know what happened. The loop deletes all assets that have size 0 - but the ensure_size is only called at the beginning of the function. So if between ensuring sizes and querying assets to be deleted a new asset is registered, it will gladly be removed as undef == 0 in perl.

My new code does not suffer from that as at the beginning I create a hash from the DB and then only work on this hash instead of doing more queries. The job groups are queried, but if new assets appeared in between they are ignored

Actions #18

Updated by okurz about 6 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: hpc_hwloc
https://openqa.suse.de/tests/1469528

Actions #19

Updated by szarate about 6 years ago

  • Target version changed from Current Sprint to Done
Actions #20

Updated by mkittler over 5 years ago

  • Related to action #45041: Asset cache cleanup still fails if new jobs are created at the same time added
Actions #21

Updated by okurz about 4 years ago

  • Related to action #64544: Asset required by scheduled job wiped by limit_assets added
Actions

Also available in: Atom PDF