Actions
action #64520
closedDeal with jobs stuck in assigned state
Start date:
2020-03-16
Due date:
% Done:
0%
Estimated time:
Tags:
Description
So far the job assignment is only revoked when a worker sends status updates claiming it is not working on a job (although a job is assigned). This means that jobs are stuck in the assigned state if a worker slot goes unexpectedly offline like in https://progress.opensuse.org/issues/64514.
It is also problematic that assigned jobs can be restarted. That creates a clone but the original job is then still stuck in the assigned state (e.g. https://openqa.opensuse.org/tests/1203580).
suggestions¶
- The stale job detection should care about assigned jobs as well (implemented within https://github.com/os-autoinst/openQA/pull/3389/commits/585a0a3ef4f84a789a3260b4f4fe9cef8311b469) after a reasonable timeout. If those jobs are just set back to scheduled we might run into https://progress.opensuse.org/issues/62984 when the worker tries to run the job after all. The already proposed PR https://github.com/os-autoinst/openQA/pull/3409 takes care of that.
- Restarting assigned jobs should be improved. We could allow to cancel them like it is possible for scheduled jobs. However, if the worker then starts the job after all we will run into https://progress.opensuse.org/issues/62984. So it makes more sense to keep the restart but ensure that the assigned job is set to cancelled (with the result user cancelled).
Actions