action #25290

Why do we now have two separate mechanisms by which workers report status to the server?

Added by AdamWill over 2 years ago. Updated 4 months ago.

Status:ResolvedStart date:14/09/2017
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:Feature requests
Target version:Done
Difficulty:
Duration:

Description

This has been bugging me at a low level for a few days, but I only just realized exactly what it is that's bugging me...

So mudler recently implemented this whole code path by which workers report their status to the server via websockets - the implementation has pinged around a bit, but basically call_websocket in lib/OpenQA/Worker/Common.pm sets up a timer called 'workerstatus-$host' that sends a status update to the server via websockets every 15 seconds.

Only...we already have a mechanism by which workers send regular status updates to the host, only that one uses API POSTs, not websockets. This uses the update_status API sub (in lib/OpenQA/WebAPI/Controller/API/V1/Job.pm), and is implemented in lib/OpenQA/Worker/Jobs.pm and lib/OpenQA/Worker/Commands.pm : basically they set up a timer called 'update_status' at the start of each job, which calls a sub also called 'update_status' which calls another sub called 'upload_status', which does the actual work (ultimately sending a POST to the /jobs/(jobid)/status API endpoint).

I am not 100% clear about what actually depends on the POSTed status updates any more. I believe the 'dead worker detection' code in lib/OpenQA/WebSockets/Server.pm used to depend on it, but it seems that since 246efe2b6 , that code uses the last time the worker checked in via websockets as its last 'seen' time, rather than the last time it POSTed a status update.

This doesn't seem like good design in general. Can we reconcile these mechanisms somehow? If it's in progress, could this maybe be explained somehow (code comments) to be less confusing?

History

#1 Updated by AdamWill over 2 years ago

One practical consequence of the POST mechanism is that the server runs various checks before deciding whether to accept those status updates - including checking whether it thinks the job mentioned belongs to the worker that's sending the message - and refuses them (sends back a 400 error) if any of the checks fail. When the worker sends an API call, it expects an 'ok' response, and if it doesn't get one, it tries twice more (every five seconds), and if it doesn't get an 'ok' response on any of the tries, it aborts the job and un-registers then re-registers itself with the server.

#2 Updated by okurz 8 months ago

  • Category changed from 132 to Feature requests

#3 Updated by okurz 5 months ago

@AdamWill I am not sure about the details but could be that this was consolidated already a bit and cleaned up. Do you think what you originally reported is still the case or should we close the ticket?

#4 Updated by mkittler 4 months ago

  • Status changed from New to Resolved
  • Target version set to Done

I have already removed the web socket route again and @kraih recently did the last bit of cleanup (https://github.com/os-autoinst/openQA/pull/2384). So this can be closed.

Also available in: Atom PDF