action #75073
closedfinalize_job_results minion task fails because 'Job xxx does not exist.'
Description
Observation¶
minion task finalize_job_results will fail when the related job does not exist:
for example https://openqa.suse.de/minion/jobs?id=690535 shows
{
"args" => [
4775400
],
"attempts" => 1,
"children" => [],
"created" => "2020-10-05T08:37:52.31081Z",
"delayed" => "2020-10-05T08:37:52.31081Z",
"expires" => undef,
"finished" => "2020-10-05T08:37:53.82253Z",
"id" => 690535,
"lax" => 0,
"notes" => {
"gru_id" => 27804961
},
"parents" => [],
"priority" => 0,
"queue" => "default",
"result" => "Job 4775400 does not exist.",
"retried" => undef,
"retries" => 0,
"started" => "2020-10-05T08:37:53.61713Z",
"state" => "failed",
"task" => "finalize_job_results",
"time" => "2020-10-22T07:58:52.96067Z",
"worker" => 357
}
Suggestion¶
When the job does not exist, the minion task exit as a success.
Updated by okurz almost 4 years ago
- Category set to Feature requests
- Status changed from New to Rejected
- Assignee set to okurz
It's good to have this ticket as reference in case we see this again. However, the linked minion job is already 18 days old and as long as this is not appearing more often and hence reproducible I don't think there is anything we should do.
Updated by okurz almost 4 years ago
- Status changed from Rejected to Workable
- Assignee deleted (
okurz)
ok, maybe you deleted more recent jobs. I have seen by today another 3 sets.
Updated by mkittler almost 4 years ago
- Assignee set to mkittler
I've seen 3 more occurrences today. Considering the amount of jobs we saw on OSD yesterday that's still a very small percentage of failures of course.
Updated by mkittler almost 4 years ago
Interestingly, all of these jobs actually exist. When retrying the Minion jobs they succeed.
The relevant jobs where only scheduled for one second until they've got cancelled again. The cancellation triggers the finalize job and in these cases the finalize job was executed without further delay. Might it be that the cancellation is done within the same transaction which created the job? If this transaction wasn't concluded when the Minion job is executed it would make sense that it doesn't "see" it yet.
By the way, this problem could theoretically also happen if a job really gets deleted after it has just been finished but before the Minion job had a chance to run. However, that seems rather unlikely to happen in practice.
Updated by mkittler almost 4 years ago
- Status changed from Workable to In Progress
see PR description: https://github.com/os-autoinst/openQA/pull/3496
Updated by mkittler almost 4 years ago
- Status changed from In Progress to Resolved
Updated by okurz almost 4 years ago
- Status changed from Resolved to Feedback
@mkittler Personally I would have preferred to keep this ticket open until the original issue is really resolved for us, i.e. no more minion jobs failing on osd for the described reason. However as the author of the ticket is a fellow team member I hope in case the problem would not go away we should be able to find the ticket again. Still, in general, tickets should visualize the work and hence the ticket can only be called "Resolved" when not anymore related work is there for us to be done. And right now on https://openqa.suse.de/minion/jobs?state=failed I still see "finalize_job_results" failing. And this will continue until we have rolled out the according change. OSD has not been updated since 2020-10-14 (!) so IMHO we can't call this done. ok? :)
Updated by mkittler almost 4 years ago
I see only one failed job from 4 hours ago. Likely the job you've mentioned has already been deleted.
The job I see now failed with "Job 4967343 does not exist."
but in this case the job really doesn't exist. Likely it has existed at some point but has been deleted.
- We could simply treat this case as a success. However, if we would really pass invalid job IDs we would simply ignore that mistake. It would also not solve errors when the openQA job is deleted during the execution of the Minion job.
- We could also abort any related Minion jobs before deleting an openQA job. The deletion of an openQA job (triggered via API or from the cleanup code) is already a little bit involved anyways (as we delete the file on the file system and delete screenshot links) so dealing with Minion jobs there shouldn't make it much worse. Of course a Minion job can have multiple related openQA jobs so we should only stop it if no other openQA job is related.
Updated by okurz almost 4 years ago
the second approach sounds like a clean solution to me :)
EDIT: Oh, but actually if you think the first approach is much easier then I would just go with that and wait for the problem that we ignore the mistake of invalid job IDs :)
Updated by mkittler almost 4 years ago
PR for option 1.: https://github.com/os-autoinst/openQA/pull/3550
The problem with option 2. is that the Gru tables don't actually allow to find the related Minion jobs of an openQA job. They only allow to find the Gru job but as far as I see there's not relation from that to the underlying Minion job. I would need to extend the Gru code for that but actually it would make more sense to phase it out completely turning GruDependencies
into MinionDependencies
(see #77704). So for now I would just go for option 1.. Of course we'll still see failing jobs when the deletion happens while the finalize jobs is executed but that's rather unlikely. (It is generally unlikely to see failed finalize_job_results jobs and I guess the low probabilities multiply.) Merging the PR for option 1. also means we suppress problems within the creation of finalize jobs (when a wrong job ID is passed).
Note that option 2. would also involve having to stop a possibly running Minion job. That could be done via a broadcast.
Updated by mkittler almost 4 years ago
- Related to action #77704: Phase out Gru tables and other no longer needed Gru abstractions added
Updated by okurz almost 4 years ago
In case your PR does not fix all the cases we see we can still exclude the specific problem situation from our monitoring and alerting.
Updated by mkittler almost 4 years ago
- Status changed from Feedback to Resolved
Like I said, my PR leaves the case when the openQA job is deleted while the related finalize Minion job is executed. However, I don't know whether that happens in production (often enough to care about it). We haven't even seen any of such failures on o3 or OSD since the one from 9 days ago and my PR has only been deployed recently.
So I'm marking the ticket as resolved and we can still re-open it if it happens again.