action #48554
closedGru/minion task deleted from GruTasks db on retry
0%
Description
So, after the recent change to the gru/minion stuff to run tasks in parallel, I sent this commit:
https://github.com/os-autoinst/openQA/commit/444e686fb78c75b96395b199f70d8efdd75288f2
to try and ensure we do not have multiple tasks trying to download the same asset at the same time.
However, after running that in staging for a bit, I've found there's a problem with it.
There's a pattern for Fedora compose tests where we post the same ISO as two different flavors; this is because we have a 'universal' flavor which can technically be run on several different ISOs, and we run it on the best candidate that's present in the compose. Usually the best candidate is the Server DVD ISO, so on a typical compose, we will post the same ISO as 'server-dvd-iso' and 'universal' flavors.
This creates two download_asset tasks for the same ISO, just the case this code guards against. Both tasks are correctly created and sent for execution as minion jobs, and both are also registered in the GruTasks database table. At this point, all the jobs for both flavors should be blocked for execution until the relevant minion job finishes.
The first download_asset task executes normally; I think what happens is we wind up in the subclassed execute
in lib/OpenQA/WebAPI/GruJob.pm , which ultimately calls the parent execute
and waits for it to return. That, I think, winds up returning when lib/OpenQA/Task/Asset/Download.pm _download
returns - for this first task, that returns when the download is complete, so that's fine. Then the subclassed execute
calls _delete_gru
and that deletes the gru task entry from GruTasks.
For the second task, however, we get a problem. The same basic flow happens: we wind up in the subclassed execute
, which calls the parent execute
, which runs the subroutine registered for the task, which runs _download
and returns when it returns. However...it returns immediately, because it just hits the return $job->retry({delay => 30})
block from my commit. At this point, our subclassed execute
calls _delete_gru
and deletes the task from GruTasks (and thus the dependency from GruDependencies) - even though the minion job itself has not executed yet, it still exists in 'inactive' state.
So now all the jobs for whichever flavor got its download_asset task scheduled second immediately execute and of course they all die, because the ISO hasn't downloaded yet.
Basically I think the problem here is the assumption that the parent execute
returning means the minion job actually finished - that assumption is no longer true, now we have this "use a guard and retry if it's already taken" behaviour in download_asset.
I suspect the same bug actually exists for other tasks where we use the same pattern, e.g. cache_asset
and cache_tests
- those tasks will also get deleted from the GruTasks database prematurely (before the minion job actually executes). The bug is likely much less visible in that case, though.
I'm really not sure what the best fix for this would be. Does anyone have any ideas? :) Thanks!