Project

General

Profile

Actions

action #51734

closed

[scheduling] Making shared workers round robin

Added by coolo almost 5 years ago. Updated over 4 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
-
Start date:
2019-05-21
Due date:
% Done:

0%

Estimated time:

Description

Instead of accepting jobs from multiple webuis, the worker should only be connected to one at a time - and stay there for let's
say a minute. If there are no jobs within that minute, it switches webuis.

This will allow to make tracking of jobs more reliable as the worker:webui mapping becomes binary instead of fuzzy.

Actions #1

Updated by mkittler almost 5 years ago

I looked into this a little bit.

To get started one obviously needs to run multiple openQA instances in parallel. Luckily it is easy to adjust ports by setting MOJO_LISTEN. For the additional databases I just created copies of my existing one. If anybody in the team using my openQA-helpers wants to accomplish the same setup (e.g. to help me with this ticket): I extended openqa-start and added a brief documentation: https://github.com/Martchus/openQA-helper#starting-the-web-ui-and-all-required-daemons

I also had a look at the code and making it round robin will for sure require a lot of refactoring.

Actions #2

Updated by mkittler almost 5 years ago

Note that this change is unlikely to reduce the amount of code of the worker. I suppose one needs to introduce at least one extra timer, a function to unregister from a certain web UI and a configuration for the max. time the worker is dedicated to a web UI with no jobs. Maybe doing some cleanup in relevant areas first would make more sense.

Actions #3

Updated by mkittler almost 5 years ago

  • Assignee set to mkittler
Actions #4

Updated by coolo almost 5 years ago

Well, simplifying the worker code is not the target of this. The goal is to simplify the worker->webui relationship. I also would like to remove the dead code detection and as such remove the responsilibty of the worker to get back to the webui. And this will simplify the worker code as the main task can be done without so much hassle with locking.

Actions #5

Updated by mkittler almost 5 years ago

  • Status changed from New to In Progress
  • Target version changed from Ready to Current Sprint
Actions #6

Updated by okurz almost 5 years ago

  • Subject changed from Making shared workers round robin to [scheduling] Making shared workers round robin
  • Category changed from 122 to Feature requests
Actions #7

Updated by mkittler almost 5 years ago

  • Status changed from In Progress to Feedback

After the discussion we had in the chat some time ago this ticket is outdated. So here I'm summarizing/repeating the concerns I previously mentioned in the chat:

  • In production we don't use shared workers so this feature would not make a difference (and also no benefit).
  • In other instances from the team shared workers are actually used but this would increase the time for starting jobs on shared workers significantly. Even the stupid behavior we currently have (a job might be assigned for a short while to a worker occupied by another web UI) seems faster. At least when doing tests for the restructured worker I have noticed that a bad assignment is changed in under a minute which is faster than the average waiting time due to round robin allocation would be.

Then @coolo responded:

I reread the discussion about the worker round robin. I would be fine if the workers had some state where they monitored multiple webuis for work to do - but then we need to extend it so they switch to 'taking jobs' explicitly - and do this only for one. Kind of back to the old 'grab job' days then

Then the question came up why we just don't have a 'working for another web UI' state. The answer from @coolo:

because of clusters - if we have to schedule 16 jobs for a cluster, having one of them failing because the worker was also picking jobs from another webui is very expensive. so we want to avoid this situation
and right now we only take the job as assigned if the worker confirmed - which creates a rather bizarre ping pong in the protocol. which leads to situation where we just don't know anymore what happened
so I would like to have this simplified: the worker decides where to register and the scheduler decides what the worker works on. if the job isn't confirmed within X minutes, it's an error of the job - and we won't reschedule it


So I see 2 use-cases which would be affected by this change:

  • scheduling clusters in a shared worker setup
    • This is the use-case round-robin is supposed to improve.
    • Why not simply avoiding shared workers when scheduling clusters?
    • @coolo suggested to improve handling this use-case by introducing a 'taking jobs' worker state which only one web UI sees at a time. When I understand this idea correctly, there is no difference to the initial round-robin ticket other than that the web socket connections are kept open to all web UI. So all web UIs see at least that the worker is online.
  • scheduling independent jobs in a shared worker setup
    • Round-robin is not useful for this use-case. It increases the average time it takes to run jobs.
    • So one should be able to disable the round-robin behavior in this case.
    • This use-case would be improved by introducing a 'working for another web UI' worker state to prevent assignments to busy workers.
Actions #8

Updated by mkittler almost 5 years ago

  • Assignee deleted (mkittler)

There was also another interesting though from @coolo:

if using shared workers is a problem we have to ask ourselves why developers share workers at all - and not webuis. And work on that.

I'm unassigning for now.

Actions #9

Updated by okurz almost 5 years ago

  • Status changed from Feedback to New
  • Target version changed from Current Sprint to Ready

if using shared workers is a problem we have to ask ourselves why developers share workers at all - and not webuis. And work on that.

Why developers share workers is obvious: Because of ressource limits. Not everybody has (or should have) access to a dedicated s390x z/VM instance for example. Why not share the web UIs, well, because it was tried to reduce dependence on other developers which might have different requirements regarding the version of os-autoinst and os-autoinst-distri-opensuse. This isn't perfect as well of course but helps a bit.

Increasing the average wait time for a job to start isn't exactly helpful.

IMHO this ticket as of now is a good example for a prescribing an implementation to developers without making clear what the actual user story is and it seems this is still what we are confused by. I suggest to not even count it as "Ready".

Actions #10

Updated by coolo almost 5 years ago

  • Target version changed from Ready to Current Sprint

Well, there is no user story behind it. And I would appreciate if you could not play PO.

The feature is clearly part of the refactoring of the worker, webui relationship.

Actions #11

Updated by okurz almost 5 years ago

sure but how do you see it as workable then? mkittler is now the "worker+scheduling expert" and he unassigned and stated that it probably does not make sense. I would not know how to work on this then and I doubt others would.

Actions #12

Updated by mkittler almost 5 years ago

  • Parent task set to #41066

So the actual use case this ticket has in mind is the one of the ticket which I now added as parent ticket. Nevertheless, the other use cases I have found are impacted by this change.

The 3rd use case would be: Run job directly after another job on the same worker (for bare metal testing).

This use case is not directly improved by implementing round robin scheduling but the idea is that round-robin scheduling simplifies the architecture removing problems blocking us to implement that use case.

Actions #13

Updated by coolo over 4 years ago

  • Status changed from New to Rejected
  • Target version deleted (Current Sprint)

we went with a different solution not requiring this

Actions #14

Updated by mkittler over 4 years ago

  • Parent task deleted (#41066)

I can not set the parent task to resolved which is likely because this ticket is rejected.

Actions

Also available in: Atom PDF