Project

General

Profile

Actions

action #25290

closed

Why do we now have two separate mechanisms by which workers report status to the server?

Added by AdamWill over 6 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2017-09-14
Due date:
% Done:

0%

Estimated time:

Description

This has been bugging me at a low level for a few days, but I only just realized exactly what it is that's bugging me...

So mudler recently implemented this whole code path by which workers report their status to the server via websockets - the implementation has pinged around a bit, but basically call_websocket in lib/OpenQA/Worker/Common.pm sets up a timer called 'workerstatus-$host' that sends a status update to the server via websockets every 15 seconds.

Only...we already have a mechanism by which workers send regular status updates to the host, only that one uses API POSTs, not websockets. This uses the update_status API sub (in lib/OpenQA/WebAPI/Controller/API/V1/Job.pm), and is implemented in lib/OpenQA/Worker/Jobs.pm and lib/OpenQA/Worker/Commands.pm : basically they set up a timer called 'update_status' at the start of each job, which calls a sub also called 'update_status' which calls another sub called 'upload_status', which does the actual work (ultimately sending a POST to the /jobs/(jobid)/status API endpoint).

I am not 100% clear about what actually depends on the POSTed status updates any more. I believe the 'dead worker detection' code in lib/OpenQA/WebSockets/Server.pm used to depend on it, but it seems that since 246efe2b6 , that code uses the last time the worker checked in via websockets as its last 'seen' time, rather than the last time it POSTed a status update.

This doesn't seem like good design in general. Can we reconcile these mechanisms somehow? If it's in progress, could this maybe be explained somehow (code comments) to be less confusing?

Actions

Also available in: Atom PDF