action #32851: [tools][EPIC] Scheduling redesign
[tools][scheduling] Worker's seen DB field is ignored by WebSocket server when checking for stale jobs
Worker's status are updated also via different routes ( e.g. while updating job status  ) but in WebSocket server we check for stale jobs using another field that is updated in the WebSocket server context  and used later to reap jobs that belongs to inactive workers .
We should unify the way we check for the worker seen status, possibly using the DB as a reference or jobs could be marked as incomplete if a blocking operation on the worker side occurs ( e.g. during cache setup phase, rsync calls, ecc.. ).
#2 Updated by EDiGiacinto over 2 years ago
For completeness, we do check that field, but after: https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/WebSockets/Server.pm#L373
- Subject changed from [tools] Worker's seen DB field is ignored by WebSocket server when checking for stale jobs to [tools][scheduling] Worker's seen DB field is ignored by WebSocket server when checking for stale jobs
- Category changed from 122 to Feature requests
is this still valid? sorry, don't understand myself
- Difficulty set to medium
Current state: The "last seen" timestamp of a worker is updated in the database when the worker updates the job status. It is also updated when the worker sends its status updates via web sockets. And yes, additionally to that, we track the "last seen" timestamp also a 2nd time in the web socket server. This 2nd timestamp is obviously not updated when the worker "just" uses the REST API. And only that timestamp is used to mark stale jobs as incomplete.
Having the timestamp twice is a bit redundant and weird. Since the database timestamp is not updated during the multi-chunk upload it wouldn't help taking it into account to prevent incomplete jobs because the worker is blocking/unresponsive. Updating the database timestamp during the upload might be quite expensive. So although having 2 timestamps is not nice I don't see any benefit in refactoring this right now.
Improving the multi-chunk upload and other blocking things on the worker is much more beneficial to prevent the problem in the first place.
Note that we sometimes see jobs in perpetual "running" or "uploading" state. I'm afraid this refactoring wouldn't help here too because in these cases the jobs are not incompleted because the worker-job relation is (somehow) unset.
So while this "curiosity" in our code base still exists I don't see a big benefit in improving it.