Project

General

Profile

Actions

action #168178

open

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #157669: websockets+scheduler improvements to support more online worker instances

Limit connected online workers based on websocket+scheduler load size:M

Added by okurz 6 months ago. Updated about 7 hours ago.

Status:
In Progress
Priority:
Low
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

With #157690 the amount of connected online workers is already limited based on a configuration variable. We can extend that to limit based on the actual websocket+scheduler load meaning to keep the number low enough to ensure proper operation of websocket+scheduler to prevent problems like #157666.

Acceptance criteria

  • AC1: A clear definition of "websocket+scheduler load" exists
  • AC2: The number of online workers is limited to min(configured_number,configured_load_limit)
  • AC3: Rejected openQA workers exceeding the mentioned limit(s) explicitly log or fail that situation

Suggestions

  • Look into the implemention of #157690 to see how the simple limit was implemented so far
  • Come up with a definition of the critical websocket+scheduler load based on "overload experiments" which can be used as a metric for the problem seen in #157666
  • Extend the simple limit with a lookup of the said metric and also prevent additional worker connections based on the metric
  • Also consider disconnecting already connected workers if the metric exceeds the configured threshold

Rollback steps

  • Ensure sapworker2.qe.nue2.suse.org is powered down as is/was used when working on this ticket to create many workers.

Related issues 2 (1 open1 closed)

Related to openQA Infrastructure (public) - action #166802: Recover worker37, worker38, worker39 size:SBlockedokurz

Actions
Copied from openQA Project (public) - action #157690: Simple global limit of registered/online workers size:MResolvedmkittler2024-03-21

Actions
Actions #1

Updated by okurz 6 months ago

  • Copied from action #157690: Simple global limit of registered/online workers size:M added
Actions #2

Updated by okurz 6 months ago

  • Target version changed from Ready to Tools - Next
Actions #3

Updated by okurz 6 months ago

  • Target version changed from Tools - Next to Ready
Actions #4

Updated by okurz 6 months ago

  • Subject changed from Limit connected online workers based on websocket+scheduler load to Limit connected online workers based on websocket+scheduler load size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by okurz 5 months ago

  • Related to action #166802: Recover worker37, worker38, worker39 size:S added
Actions #6

Updated by okurz 4 months ago

  • Target version changed from Ready to Tools - Next
Actions #7

Updated by okurz about 2 months ago

  • Target version changed from Tools - Next to Ready
Actions #8

Updated by mkittler about 1 month ago

I don't remember exactly how we envisioned this to work.

Did we have figures supporting the fact that the load of the websocket server and scheduler are actually high in the problematic situation? I'm asking that because I highly doubt that this is the case. We have already established that at least the websocket server does not cause much CPU load. The same is probably true for the scheduler. I also doubt that both cause a considerable amount of I/O load. Maybe the worker processes of PostgreSQL cause a high I/O load (or even CPU load) instead (which would make it hard to pin-down this load to the scheduling problem). Maybe none of the processes cause a high load because locks held by some txn are the bottleneck.

I would probably approach this by trying to solve the actual problem first - which means provoking it in some way locally with the help of some servers with enough RAM¹. Then I'd closely observe the resource usage locally and what causes this exactly. Depending on the findings fix the problem itself might be simpler than adding a dynamic limit in case the problem occurs². So adding a dynamic limit might not make sense and adding it blindly without knowing what to look for even less.

So I guess by trying to fight the symptoms first we're approaching the problem from the wrong angle.


¹ When working on this before I noticed that with "only" 32 GiB RAM the number of worker slots I could start on my laptop was quite limited.
² Adding a dynamic limit isn't that trivial.

Actions #9

Updated by okurz about 1 month ago

  • Target version changed from Ready to Tools - Next
Actions #10

Updated by okurz about 1 month ago

mkittler wrote in #note-8:

I would probably approach this by trying to solve the actual problem first - which means provoking it in some way locally with the help of some servers with enough RAM¹. Then I'd closely observe the resource usage locally and what causes this exactly.

I agree. That is how I interpret the suggestion "Come up with a definition of the critical websocket+scheduler load based on 'overload experiments'"

Actions #12

Updated by okurz 20 days ago

  • Target version changed from Tools - Next to Ready
Actions #13

Updated by mkittler 6 days ago

  • Assignee set to mkittler

Solving the problem by removing a bottle neck or problematic behavior in general is not going to bring us to any kind of definition of load. If it is possible to solve this without adding a limit I would also avoid adding a limit (because additional monitoring in order to enforce a limit would add complexity).

I can still move this ticket forward to some extend even though the ACs might not be useful in the end.

Actions #14

Updated by mkittler about 7 hours ago

  • Description updated (diff)
  • Status changed from Workable to In Progress
Actions

Also available in: Atom PDF