Project

General

Profile

Actions

action #48554

closed

Gru/minion task deleted from GruTasks db on retry

Added by AdamWill about 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
2019-02-28
Due date:
% Done:

0%

Estimated time:

Description

So, after the recent change to the gru/minion stuff to run tasks in parallel, I sent this commit:

https://github.com/os-autoinst/openQA/commit/444e686fb78c75b96395b199f70d8efdd75288f2

to try and ensure we do not have multiple tasks trying to download the same asset at the same time.

However, after running that in staging for a bit, I've found there's a problem with it.

There's a pattern for Fedora compose tests where we post the same ISO as two different flavors; this is because we have a 'universal' flavor which can technically be run on several different ISOs, and we run it on the best candidate that's present in the compose. Usually the best candidate is the Server DVD ISO, so on a typical compose, we will post the same ISO as 'server-dvd-iso' and 'universal' flavors.

This creates two download_asset tasks for the same ISO, just the case this code guards against. Both tasks are correctly created and sent for execution as minion jobs, and both are also registered in the GruTasks database table. At this point, all the jobs for both flavors should be blocked for execution until the relevant minion job finishes.

The first download_asset task executes normally; I think what happens is we wind up in the subclassed execute in lib/OpenQA/WebAPI/GruJob.pm , which ultimately calls the parent execute and waits for it to return. That, I think, winds up returning when lib/OpenQA/Task/Asset/Download.pm _download returns - for this first task, that returns when the download is complete, so that's fine. Then the subclassed execute calls _delete_gru and that deletes the gru task entry from GruTasks.

For the second task, however, we get a problem. The same basic flow happens: we wind up in the subclassed execute, which calls the parent execute, which runs the subroutine registered for the task, which runs _download and returns when it returns. However...it returns immediately, because it just hits the return $job->retry({delay => 30}) block from my commit. At this point, our subclassed execute calls _delete_gru and deletes the task from GruTasks (and thus the dependency from GruDependencies) - even though the minion job itself has not executed yet, it still exists in 'inactive' state.

So now all the jobs for whichever flavor got its download_asset task scheduled second immediately execute and of course they all die, because the ISO hasn't downloaded yet.

Basically I think the problem here is the assumption that the parent execute returning means the minion job actually finished - that assumption is no longer true, now we have this "use a guard and retry if it's already taken" behaviour in download_asset.

I suspect the same bug actually exists for other tasks where we use the same pattern, e.g. cache_asset and cache_tests - those tasks will also get deleted from the GruTasks database prematurely (before the minion job actually executes). The bug is likely much less visible in that case, though.

I'm really not sure what the best fix for this would be. Does anyone have any ideas? :) Thanks!

Actions

Also available in: Atom PDF