Project

General

Profile

Actions

action #64544

closed

Asset required by scheduled job wiped by limit_assets

Added by AdamWill over 4 years ago. Updated over 4 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
-
Start date:
2020-03-17
Due date:
% Done:

0%

Estimated time:

Description

Lately in Fedora openQA I've noticed several occasions where an asset required by a still-scheduled job was removed by limit_assets. This is not supposed to happen.

Here is one example:

https://openqa.stg.fedoraproject.org/tests/764286
https://openqa.stg.fedoraproject.org/tests/764287

#764286 creates and uploads a live image. #764287 should have run a boot-and-install test with that live image, but it (and its army of clones) all failed immediately because the live image had been garbage collected. The start time of that job is logged as 23:13:11 . This limit_assets task:

https://openqa.stg.fedoraproject.org/minion/jobs?id=40629

is what removed it. It's logged as starting at "2020-03-16T23:00:15.7527Z" , at which time 764287 must still have been scheduled. It should not have removed an asset for a pending job.


Related issues 3 (2 open1 closed)

Related to openQA Project (public) - action #12180: [webui] Prevent tests to be triggered when required assets are not present (anymore)New2016-05-31

Actions
Related to openQA Project (public) - action #19672: GRU may delete assets while jobs are registeredResolvedcoolo2017-06-08

Actions
Related to openQA Project (public) - action #121573: Asset/HDD goes missing while job is runningNew2022-12-06

Actions
Actions #1

Updated by AdamWill over 4 years ago

Note, we're currently running git commit 4861e34 , from 2020-02-05. I checked lib/OpenQA/Task/Asset/Limit.pm and lib/OpenQA/Schema/ResultSet/Assets.pm and neither has really changed since then (just a license notice change in the latter).

Actions #2

Updated by mkittler over 4 years ago

I can not open the Minion dashboard without login. However, it looks like the asset in question is HDD_1=disk_f32_minimal_3_x86_64.img. I can only find this setting in the parent job. The child job only has ISO=Fedora-Workstation-Live-x86_64-FEDORA-2020-9d65d662e2.iso. So unless the parent job is supposed to create the ISO file it seems that there's simply the HDD_1 setting missing in the child job.

(We have usually e.g. PUBLISH_HDD_1=opensuse-Tumbleweed-x86_64-20200317-kde@64bit.qcow2 in the parent and HDD_1=opensuse-Tumbleweed-x86_64-20200317-kde@64bit.qcow2 in the child job.)

Actions #3

Updated by AdamWill over 4 years ago

No, that's not the asset in question. That's a fixed asset. The asset that got wiped was Fedora-Workstation-Live-x86_64-FEDORA-2020-9d65d662e2.iso . That is created and uploaded by 764286 (live_build).

Actions #4

Updated by okurz over 4 years ago

  • Related to action #12180: [webui] Prevent tests to be triggered when required assets are not present (anymore) added
Actions #5

Updated by okurz over 4 years ago

  • Category set to Support
  • Status changed from New to Feedback
  • Assignee set to okurz

if the quota is exceeded the asset is deleted. Could this be the case? If yes, I don't think it's a good default to allow an over-quota condition until a job is finished but I see the following possibilities: 1. Keep as is: If quota is exceeded assets are deleted, any time; 2: Based on configuration options allow to keep assets until jobs are completed or even until jobs referencing assets are deleted (this can use much more space than quota); 3: Go even further and define "soft-quota", "hard-quota" like in GNU/Linux aquota while keeping assets even if over soft-quota, keep assets for unfinished jobs until hard-quota is reached where also assets for unfinished are deleted.

I would really suggest to keep 1.

Actions #6

Updated by AdamWill over 4 years ago

I'd agree those are the choices, but if 1) is the current case, then things changed in some rewrite along the line, because 2) used to be intentionally and explicitly the case (or rather, there was not even any configuration choice, we just always kept assets associated with pending jobs). I know because I wrote it :) (you also commented on it, though after coolo merged it)

https://github.com/os-autoinst/openQA/pull/1518

Actions #7

Updated by okurz over 4 years ago

  • Related to action #19672: GRU may delete assets while jobs are registered added
Actions #8

Updated by okurz over 4 years ago

  • Category changed from Support to Regressions/Crashes
  • Status changed from Feedback to Workable
  • Assignee changed from okurz to mkittler
  • Target version set to Current Sprint

Interesting. Then I consider this a regression. Though I am not aware which change could have caused this.

@mkittler you might be able to know?

Actions #9

Updated by mkittler over 4 years ago

I know about the feature to retain "pending" assets. There is also a testcase for that feature in t/14-grutasks.t. Of course the test case might be broken as well as the feature but we haven't changed a lot in that area recently. I'll try to break the "pending" job handling locally to check whether the test in t/14-grutasks.t would catch that. Depending on the outcome I can investigate the problem further. Of course without access to the openQA instance and general lack of reproducibility I might get stuck. That also raises the questions: Is this happening frequently on your instance (@AdamWill)? Is this happening in our instances (@tools-team)? I haven't came across it so far on our instances.

Actions #10

Updated by AdamWill over 4 years ago

@mkittler, yup, that's why I wrote "if 1) is the current case" - I figured it might well not be an intentional change.

Unfortunately no this isn't happening very often or anything - I just happened to catch a few cases. I believe that, at the time this happened, I was re-running a lot of update tests simultaneously, and the asset size limit for our 'Fedora Updates' group is only 100GB, so it's possible we ran into some edge case in the logic here or something.

Later today I'll try and gin up a search to find any other cases that failed similarly.

Actions #11

Updated by mkittler over 4 years ago

  • Status changed from Workable to In Progress

The following query is used to compute the "pending" state of an asset:

    # define query for prefetching the assets - note the sort order here:
    # We sort the assets in descending order by highest related job ID,
    # so assets for recent jobs are considered first (and most likely to be kept).
    # Use of coalesce is required; otherwise assets without any job would end up
    # at the top.
    my $prioritized_assets_query;
    if ($options{compute_pending_state_and_max_job}) {
        $prioritized_assets_query = <<'END_SQL';
            select
                a.id as id, a.name as name, a.t_created as t_created, a.size as size, a.type as type,
                a.fixed as fixed,
                coalesce(max(j.id), -1) as max_job,
                max(case when j.id is not null and j.result='none' then 1 else 0 end) as pending
            from assets a
                left join jobs_assets ja on a.id=ja.asset_id
                left join jobs j on j.id=ja.job_id
            group by a.id
            order by max_job desc, a.t_created desc;

(see lib/OpenQA/Schema/ResultSet/Assets.pm)

I guess it was even me who changed this computation the last time. The problem I wanted to address the last time were race conditions which lead to failing cleanup tasks. However, the query is not perfect. It only looks at the most recently created job for a certain asset. That job can of course have already been concluded while a previous job is still running.

Maybe that can easily improved. But before that it is best to look at the test we actually have (supposedly t/14-grutasks.t) which is not sufficient. When I change the if condition to then 1 else 1 end it fails but when I change it to then 0 else 0 end it still passes.

Actions #12

Updated by mkittler over 4 years ago

Actions #13

Updated by mkittler over 4 years ago

Apparently it is not like I suspected. The test I've created in fact fails in a different way than expected. Recognizing that an asset is pending does not only rely on the most recent job associated to that asset and therefore the asset would actually be always correctly preserved. (So the condition case when j.id is not null and j.result='none' then 1 else 0 end in the query is actually evaluated for every associated job.)

But in my test a different and seemingly unrelated asset is additionally deleted when the most recent associated job is not pending but only a previous one. The setup of the unit tests we have is really hard to understand and surprising things happen when you adjust a detail. But maybe I actually found another limitation within the accounting which might be the reason for the issue.

Actions #14

Updated by mkittler over 4 years ago

  • Priority changed from High to Normal

I found out why my tests were failing and it was just the test setup. It is not very easy to test this because the order assets are considered is to some degree random.

So I could only find out so far that preserving pending assets works and that all jobs associated with the asset are considered to determine whether an asset is pending. Any job which has no result is currently considered pending. This is slightly wrong (see https://github.com/os-autoinst/openQA/pull/2918) but should not cause the problem this ticket is about.

I'm afraid without having a database with the jobs and assets at the point where the cleanup accidentally deletes jobs it is hard to find out what is causing it. At least the log of the limit task would be nice (I can not access the link to your Minion dashboard).

Actions #15

Updated by mkittler over 4 years ago

  • Status changed from In Progress to New
  • Assignee deleted (mkittler)

I'm currently out of ideas.

Actions #16

Updated by okurz over 4 years ago

  • Due date set to 2020-05-26
  • Status changed from New to Feedback
  • Assignee set to AdamWill
  • Target version deleted (Current Sprint)

@AdamWill could you please help us by providing details what you see missing here?

Actions #17

Updated by AdamWill over 4 years ago

I mean, I don't know what details you want. I reported a bug: I saw that a test had failed incomplete because an asset it needed had been garbage collected while it was scheduled. That's not supposed to happen, so I reported it. That's really all I've got.

Actions #18

Updated by okurz over 4 years ago

  • Due date deleted (2020-05-26)
  • Status changed from Feedback to Rejected
  • Assignee changed from AdamWill to okurz

I'm afraid without the details, e.g. log of the limit task from the minion dashboard, we can't solve it. mkittler has verified that assets linked to pending jobs shouldn't be deleted by default.

Actions #19

Updated by okurz about 2 years ago

  • Related to action #121573: Asset/HDD goes missing while job is running added
Actions

Also available in: Atom PDF