Project

General

Profile

Actions

action #27454

closed

coordination #32851: [tools][EPIC] Scheduling redesign

[tools][scheduling] Worker's seen DB field is ignored by WebSocket server when checking for stale jobs

Added by EDiGiacinto almost 7 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Feature requests
Target version:
Start date:
2018-05-05
Due date:
% Done:

0%

Estimated time:

Description

Worker's status are updated also via different routes ( e.g. while updating job status [1] ) but in WebSocket server we check for stale jobs using another field that is updated in the WebSocket server context [1] and used later to reap jobs that belongs to inactive workers [3].
We should unify the way we check for the worker seen status, possibly using the DB as a reference or jobs could be marked as incomplete if a blocking operation on the worker side occurs ( e.g. during cache setup phase, rsync calls, ecc.. ).

  1. https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Schema/Result/Jobs.pm#L1357
  2. https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/WebSockets/Server.pm#L204
  3. https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/WebSockets/Server.pm#L348

Related issues 1 (0 open1 closed)

Related to openQA Project - action #25970: Profile/Optimize _workers_checker in WebSockets serverResolved2017-10-11

Actions
Actions #1

Updated by EDiGiacinto almost 7 years ago

  • Related to action #25970: Profile/Optimize _workers_checker in WebSockets server added
Actions #3

Updated by coolo almost 7 years ago

  • Target version set to Ready

we stopped updating this field as it was causing a lot of DB noise to update the field every subsecond.

Actions #4

Updated by EDiGiacinto over 6 years ago

  • Category set to 122
  • Parent task set to #32851

This is still related to scheduling (as some logic is split in the ws server)

Actions #5

Updated by szarate over 6 years ago

  • Start date changed from 2017-11-07 to 2018-05-05

due to changes in a related task

Actions #6

Updated by okurz about 5 years ago

  • Subject changed from [tools] Worker's seen DB field is ignored by WebSocket server when checking for stale jobs to [tools][scheduling] Worker's seen DB field is ignored by WebSocket server when checking for stale jobs
  • Category changed from 122 to Feature requests

is this still valid? sorry, don't understand myself

Actions #7

Updated by mkittler about 5 years ago

  • Difficulty set to medium

Current state: The "last seen" timestamp of a worker is updated in the database when the worker updates the job status. It is also updated when the worker sends its status updates via web sockets. And yes, additionally to that, we track the "last seen" timestamp also a 2nd time in the web socket server. This 2nd timestamp is obviously not updated when the worker "just" uses the REST API. And only that timestamp is used to mark stale jobs as incomplete.

Having the timestamp twice is a bit redundant and weird. Since the database timestamp is not updated during the multi-chunk upload it wouldn't help taking it into account to prevent incomplete jobs because the worker is blocking/unresponsive. Updating the database timestamp during the upload might be quite expensive. So although having 2 timestamps is not nice I don't see any benefit in refactoring this right now.

Improving the multi-chunk upload and other blocking things on the worker is much more beneficial to prevent the problem in the first place.

Note that we sometimes see jobs in perpetual "running" or "uploading" state. I'm afraid this refactoring wouldn't help here too because in these cases the jobs are not incompleted because the worker-job relation is (somehow) unset.

So while this "curiosity" in our code base still exists I don't see a big benefit in improving it.

Actions #8

Updated by mkittler almost 5 years ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
  • Target version changed from Ready to Current Sprint
Actions #9

Updated by mkittler almost 5 years ago

  • Status changed from In Progress to Resolved
  • Target version changed from Current Sprint to Done

PR has been merged

Actions

Also available in: Atom PDF