action #27454

action #32851: [tools][EPIC] Scheduling redesign

[tools][scheduling] Worker's seen DB field is ignored by WebSocket server when checking for stale jobs

Added by EDiGiacinto over 2 years ago. Updated 6 months ago.

Status:ResolvedStart date:05/05/2018
Priority:LowDue date:
Assignee:mkittler% Done:


Category:Feature requests
Target version:Done


Worker's status are updated also via different routes ( e.g. while updating job status [1] ) but in WebSocket server we check for stale jobs using another field that is updated in the WebSocket server context [1] and used later to reap jobs that belongs to inactive workers [3].
We should unify the way we check for the worker seen status, possibly using the DB as a reference or jobs could be marked as incomplete if a blocking operation on the worker side occurs ( e.g. during cache setup phase, rsync calls, ecc.. ).


Related issues

Related to openQA Project - action #25970: Profile/Optimize _workers_checker in WebSockets server Resolved 11/10/2017


#1 Updated by EDiGiacinto over 2 years ago

  • Related to action #25970: Profile/Optimize _workers_checker in WebSockets server added

#3 Updated by coolo over 2 years ago

  • Target version set to Ready

we stopped updating this field as it was causing a lot of DB noise to update the field every subsecond.

#4 Updated by EDiGiacinto about 2 years ago

  • Category set to 122
  • Parent task set to #32851

This is still related to scheduling (as some logic is split in the ws server)

#5 Updated by szarate almost 2 years ago

  • Start date changed from 07/11/2017 to 05/05/2018

due to changes in a related task

#6 Updated by okurz 10 months ago

  • Subject changed from [tools] Worker's seen DB field is ignored by WebSocket server when checking for stale jobs to [tools][scheduling] Worker's seen DB field is ignored by WebSocket server when checking for stale jobs
  • Category changed from 122 to Feature requests

is this still valid? sorry, don't understand myself

#7 Updated by mkittler 8 months ago

  • Difficulty set to medium

Current state: The "last seen" timestamp of a worker is updated in the database when the worker updates the job status. It is also updated when the worker sends its status updates via web sockets. And yes, additionally to that, we track the "last seen" timestamp also a 2nd time in the web socket server. This 2nd timestamp is obviously not updated when the worker "just" uses the REST API. And only that timestamp is used to mark stale jobs as incomplete.

Having the timestamp twice is a bit redundant and weird. Since the database timestamp is not updated during the multi-chunk upload it wouldn't help taking it into account to prevent incomplete jobs because the worker is blocking/unresponsive. Updating the database timestamp during the upload might be quite expensive. So although having 2 timestamps is not nice I don't see any benefit in refactoring this right now.

Improving the multi-chunk upload and other blocking things on the worker is much more beneficial to prevent the problem in the first place.

Note that we sometimes see jobs in perpetual "running" or "uploading" state. I'm afraid this refactoring wouldn't help here too because in these cases the jobs are not incompleted because the worker-job relation is (somehow) unset.

So while this "curiosity" in our code base still exists I don't see a big benefit in improving it.

#8 Updated by mkittler 6 months ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
  • Target version changed from Ready to Current Sprint

#9 Updated by mkittler 6 months ago

  • Status changed from In Progress to Resolved
  • Target version changed from Current Sprint to Done

PR has been merged

Also available in: Atom PDF