action #27454: [tools][scheduling] Worker's seen DB field is ignored by WebSocket server when checking for stale jobs - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #27454

closed

coordination #32851: [tools][EPIC] Scheduling redesign

[tools][scheduling] Worker's seen DB field is ignored by WebSocket server when checking for stale jobs

Added by EDiGiacinto over 7 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

Low

Assignee:

mkittler

Category:

Feature requests

Target version:

Done

Start date:

2018-05-05

Due date:

% Done:

Estimated time:

Description

Worker's status are updated also via different routes ( e.g. while updating job status [1] ) but in WebSocket server we check for stale jobs using another field that is updated in the WebSocket server context [1] and used later to reap jobs that belongs to inactive workers [3].
We should unify the way we check for the worker seen status, possibly using the DB as a reference or jobs could be marked as incomplete if a blocking operation on the worker side occurs ( e.g. during cache setup phase, rsync calls, ecc.. ).

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by EDiGiacinto over 7 years ago

Related to action #25970: Profile/Optimize _workers_checker in WebSockets server added

Actions

Copy link

Updated by EDiGiacinto over 7 years ago

For completeness, we do check that field, but after: https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/WebSockets/Server.pm#L373

Actions

Copy link

Updated by coolo over 7 years ago

Target version set to Ready

we stopped updating this field as it was causing a lot of DB noise to update the field every subsecond.

Actions

Copy link

Updated by EDiGiacinto about 7 years ago

Category set to 122
Parent task set to #32851

This is still related to scheduling (as some logic is split in the ws server)

Actions

Copy link

Updated by szarate almost 7 years ago

Start date changed from 2017-11-07 to 2018-05-05

due to changes in a related task

Actions

Copy link

Updated by okurz almost 6 years ago

Subject changed from [tools] Worker's seen DB field is ignored by WebSocket server when checking for stale jobs to [tools][scheduling] Worker's seen DB field is ignored by WebSocket server when checking for stale jobs
Category changed from 122 to Feature requests

is this still valid? sorry, don't understand myself

Actions

Copy link

Updated by mkittler over 5 years ago

Difficulty set to medium

Current state: The "last seen" timestamp of a worker is updated in the database when the worker updates the job status. It is also updated when the worker sends its status updates via web sockets. And yes, additionally to that, we track the "last seen" timestamp also a 2nd time in the web socket server. This 2nd timestamp is obviously not updated when the worker "just" uses the REST API. And only that timestamp is used to mark stale jobs as incomplete.

Having the timestamp twice is a bit redundant and weird. Since the database timestamp is not updated during the multi-chunk upload it wouldn't help taking it into account to prevent incomplete jobs because the worker is blocking/unresponsive. Updating the database timestamp during the upload might be quite expensive. So although having 2 timestamps is not nice I don't see any benefit in refactoring this right now.

Improving the multi-chunk upload and other blocking things on the worker is much more beneficial to prevent the problem in the first place.

Note that we sometimes see jobs in perpetual "running" or "uploading" state. I'm afraid this refactoring wouldn't help here too because in these cases the jobs are not incompleted because the worker-job relation is (somehow) unset.

So while this "curiosity" in our code base still exists I don't see a big benefit in improving it.

Actions

Copy link