Project

General

Profile

action #64520

Updated by mkittler over 3 years ago

So far the job assignment is only revoked when a worker sends status updates claiming it is not working on a job (although a job is assigned). This means that jobs are stuck in the assigned state if a worker slot goes unexpectedly offline like in https://progress.opensuse.org/issues/64514. 

 It is also problematic that assigned jobs can be restarted. That creates a clone but the original job is then still stuck in the assigned state (e.g. https://openqa.opensuse.org/tests/1203580). 

 ### suggestions 

 1. The stale job detection should care about assigned jobs as well (implemented within https://github.com/os-autoinst/openQA/pull/3389/commits/585a0a3ef4f84a789a3260b4f4fe9cef8311b469) after a reasonable timeout. If those jobs are just set back to scheduled we might run into https://progress.opensuse.org/issues/62984 when the worker tries to run the job after all. The already proposed PR https://github.com/os-autoinst/openQA/pull/3409 takes care of that. all (very unlikely but possible). So maybe the jobs should be marked as incomplete and be cloned. 
 2. Restarting assigned jobs should be improved. We could allow to cancel them like it is possible for scheduled jobs. However, if the worker then starts the job after all we will run into https://progress.opensuse.org/issues/62984. So it makes more sense to keep the restart but ensure that the assigned job is set to cancelled (with the result user cancelled).

Back