Why do we now have two separate mechanisms by which workers report status to the server?
This has been bugging me at a low level for a few days, but I only just realized exactly what it is that's bugging me...
So mudler recently implemented this whole code path by which workers report their status to the server via websockets - the implementation has pinged around a bit, but basically
call_websocket in lib/OpenQA/Worker/Common.pm sets up a timer called 'workerstatus-$host' that sends a status update to the server via websockets every 15 seconds.
Only...we already have a mechanism by which workers send regular status updates to the host, only that one uses API POSTs, not websockets. This uses the
update_status API sub (in lib/OpenQA/WebAPI/Controller/API/V1/Job.pm), and is implemented in lib/OpenQA/Worker/Jobs.pm and lib/OpenQA/Worker/Commands.pm : basically they set up a timer called 'update_status' at the start of each job, which calls a sub also called 'update_status' which calls another sub called 'upload_status', which does the actual work (ultimately sending a POST to the /jobs/(jobid)/status API endpoint).
I am not 100% clear about what actually depends on the POSTed status updates any more. I believe the 'dead worker detection' code in lib/OpenQA/WebSockets/Server.pm used to depend on it, but it seems that since 246efe2b6 , that code uses the last time the worker checked in via websockets as its last 'seen' time, rather than the last time it POSTed a status update.
This doesn't seem like good design in general. Can we reconcile these mechanisms somehow? If it's in progress, could this maybe be explained somehow (code comments) to be less confusing?
#1 Updated by AdamWill over 2 years ago
One practical consequence of the POST mechanism is that the server runs various checks before deciding whether to accept those status updates - including checking whether it thinks the job mentioned belongs to the worker that's sending the message - and refuses them (sends back a 400 error) if any of the checks fail. When the worker sends an API call, it expects an 'ok' response, and if it doesn't get one, it tries twice more (every five seconds), and if it doesn't get an 'ok' response on any of the tries, it aborts the job and un-registers then re-registers itself with the server.