action #48554
closedGru/minion task deleted from GruTasks db on retry
0%
Description
So, after the recent change to the gru/minion stuff to run tasks in parallel, I sent this commit:
https://github.com/os-autoinst/openQA/commit/444e686fb78c75b96395b199f70d8efdd75288f2
to try and ensure we do not have multiple tasks trying to download the same asset at the same time.
However, after running that in staging for a bit, I've found there's a problem with it.
There's a pattern for Fedora compose tests where we post the same ISO as two different flavors; this is because we have a 'universal' flavor which can technically be run on several different ISOs, and we run it on the best candidate that's present in the compose. Usually the best candidate is the Server DVD ISO, so on a typical compose, we will post the same ISO as 'server-dvd-iso' and 'universal' flavors.
This creates two download_asset tasks for the same ISO, just the case this code guards against. Both tasks are correctly created and sent for execution as minion jobs, and both are also registered in the GruTasks database table. At this point, all the jobs for both flavors should be blocked for execution until the relevant minion job finishes.
The first download_asset task executes normally; I think what happens is we wind up in the subclassed execute
in lib/OpenQA/WebAPI/GruJob.pm , which ultimately calls the parent execute
and waits for it to return. That, I think, winds up returning when lib/OpenQA/Task/Asset/Download.pm _download
returns - for this first task, that returns when the download is complete, so that's fine. Then the subclassed execute
calls _delete_gru
and that deletes the gru task entry from GruTasks.
For the second task, however, we get a problem. The same basic flow happens: we wind up in the subclassed execute
, which calls the parent execute
, which runs the subroutine registered for the task, which runs _download
and returns when it returns. However...it returns immediately, because it just hits the return $job->retry({delay => 30})
block from my commit. At this point, our subclassed execute
calls _delete_gru
and deletes the task from GruTasks (and thus the dependency from GruDependencies) - even though the minion job itself has not executed yet, it still exists in 'inactive' state.
So now all the jobs for whichever flavor got its download_asset task scheduled second immediately execute and of course they all die, because the ISO hasn't downloaded yet.
Basically I think the problem here is the assumption that the parent execute
returning means the minion job actually finished - that assumption is no longer true, now we have this "use a guard and retry if it's already taken" behaviour in download_asset.
I suspect the same bug actually exists for other tasks where we use the same pattern, e.g. cache_asset
and cache_tests
- those tasks will also get deleted from the GruTasks database prematurely (before the minion job actually executes). The bug is likely much less visible in that case, though.
I'm really not sure what the best fix for this would be. Does anyone have any ideas? :) Thanks!
Updated by AdamWill about 5 years ago
I'm not...entirely sure if this would work:
--- a/lib/OpenQA/WebAPI/GruJob.pm
+++ b/lib/OpenQA/WebAPI/GruJob.pm
@@ -48,7 +48,7 @@ sub execute {
if $self->fail({(output => $buffer) x !!(defined $buffer), error => $err})
&& exists $self->info->{notes}{gru_id};
}
- else {
+ elsif ($self->is_finished) {
$self->finish(defined $buffer ? $buffer : 'Job successfully executed');
$self->_delete_gru($self->info->{notes}{gru_id}) if exists $self->info->{notes}{gru_id};
}
I think it might. I guess I'll try it and see...
Updated by AdamWill about 5 years ago
No, that seems to not work: even when the minion job is actually executed successfully, that conditional check fails for some reason and the gru task is not deleted...
Updated by AdamWill about 5 years ago
https://github.com/os-autoinst/openQA/pull/2008 does seem to fix this, though. Running in Fedora staging and nothing has blown up yet...
Updated by AdamWill about 5 years ago
Update again: kraih came up with a slightly different PR, with tests. Since he's the expert here I'll defer to him :) I'm going to test soon and confirm his PR fixes the Fedora case, I expect it to. That PR is https://github.com/os-autoinst/openQA/pull/2011
Updated by kraih about 5 years ago
- Status changed from In Progress to Resolved
The PR has been merged, so this should be resolved now.