action #64520
closedDeal with jobs stuck in assigned state
0%
Description
So far the job assignment is only revoked when a worker sends status updates claiming it is not working on a job (although a job is assigned). This means that jobs are stuck in the assigned state if a worker slot goes unexpectedly offline like in https://progress.opensuse.org/issues/64514.
It is also problematic that assigned jobs can be restarted. That creates a clone but the original job is then still stuck in the assigned state (e.g. https://openqa.opensuse.org/tests/1203580).
suggestions¶
- The stale job detection should care about assigned jobs as well (implemented within https://github.com/os-autoinst/openQA/pull/3389/commits/585a0a3ef4f84a789a3260b4f4fe9cef8311b469) after a reasonable timeout. If those jobs are just set back to scheduled we might run into https://progress.opensuse.org/issues/62984 when the worker tries to run the job after all. The already proposed PR https://github.com/os-autoinst/openQA/pull/3409 takes care of that.
- Restarting assigned jobs should be improved. We could allow to cancel them like it is possible for scheduled jobs. However, if the worker then starts the job after all we will run into https://progress.opensuse.org/issues/62984. So it makes more sense to keep the restart but ensure that the assigned job is set to cancelled (with the result user cancelled).
Updated by mkittler over 4 years ago
- Project changed from openQA Infrastructure (public) to openQA Project (public)
Updated by okurz over 4 years ago
- Category set to Regressions/Crashes
similar: https://openqa.suse.de/tests/3998087 is stuck in running for 24 days on 100% progress with reason already filled on osd. The job is "running" but also "finished" since 24 days on 100% progress. I did not change the job so that we can investigate it further.
info from job over API:
$ openqa_client_osd --json-output jobs/3998087
{
"job" : {
"id" : 3998087,
"state" : "running",
"t_started" : "2020-03-16T19:54:27",
"group" : "Functional",
"reason" : "api failure: 400 response: Got status update for job 3998087 and worker 527 but there is not even a worker assigned to this job (job is scheduled)",
"assets" : {
"other" : [
"SLE-15-SP2-Online-aarch64-Build160.1-Media1.iso.sha256"
]
},
"name" : "sle-15-SP2-Online-aarch64-Build160.1-default@aarch64",
"test" : "default",
"priority" : 50,
"blocked_by_id" : null,
"result" : "incomplete",
"t_finished" : "2020-03-16T19:54:26",
"group_id" : 110,
"settings" : {
"VERSION" : "15-SP2",
…
"BACKEND" : "qemu"
},
"parents" : {
"Chained" : [],
"Directly chained" : [],
"Parallel" : []
},
"clone_id" : 4001121,
"assigned_worker_id" : 527,
"children" : {
"Directly chained" : [],
"Parallel" : [],
"Chained" : []
}
}
}
Updated by mkittler over 4 years ago
Of course this job was actually set to running. The worker even sent a reason but somehow the web UI didn't set the state to done. That is very weird because I'm not aware of a way to set the reason and not setting the job state to done at the same time. Likely there's yet another race condition somewhere.
Updated by okurz about 4 years ago
- Status changed from New to Blocked
- Assignee set to okurz
looks related to #69784 which we should probably wait for before re-evaluating.
Updated by mkittler about 4 years ago
- Status changed from Blocked to In Progress
- Assignee changed from okurz to mkittler
I re-evaluate this ticket myself:
- I'll have to re-check my concern "If those jobs are just set back to scheduled…" in suggestion 1. because setting those jobs back to scheduled has now been implemented.
I suppose if the worker comes back after all it would simply run into API errors but those wouldn't be recorded (like in #62984) because the worker is no longer assigned.Looks like the API error will be recorded and it would even interfere when another worker has already taken that job. - Suggestion 2. should still be implemented.
Updated by mkittler about 4 years ago
- Description updated (diff)
PR https://github.com/os-autoinst/openQA/pull/3409 will solve the points from my previous comment.
Updated by mkittler about 4 years ago
- Status changed from In Progress to Resolved
The PR has been merged. In order to test this locally with the full stack I modified the worker code to simulate a slow worker and started a 2nd worker at the right time. That's nothing I want to do in production so I'll mark the ticket as resolved.