Project

General

Profile

Actions

action #168178

open

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #157669: websockets+scheduler improvements to support more online worker instances

Limit connected online workers based on websocket+scheduler load size:M

Added by okurz 20 days ago. Updated 15 days ago.

Status:
Workable
Priority:
Low
Assignee:
-
Category:
Feature requests
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

With #157690 the amount of connected online workers is already limited based on a configuration variable. We can extend that to limit based on the actual websocket+scheduler load meaning to keep the number low enough to ensure proper operation of websocket+scheduler to prevent problems like #157666.

Acceptance criteria

  • AC1: A clear definition of "websocket+scheduler load" exists
  • AC2: The number of online workers is limited to min(configured_number,configured_load_limit)
  • AC3: Rejected openQA workers exceeding the mentioned limit(s) explicitly log or fail that situation

Suggestions

  • Look into the implemention of #157690 to see how the simple limit was implemented so far
  • Come up with a definition of the critical websocket+scheduler load based on "overload experiments" which can be used as a metric for the problem seen in #157666
  • Extend the simple limit with a lookup of the said metric and also prevent additional worker connections based on the metric
  • Also consider disconnecting already connected workers if the metric exceeds the configured threshold

Related issues 2 (1 open1 closed)

Related to openQA Infrastructure - action #166802: Recover worker37, worker38, worker39 size:SBlockedokurz

Actions
Copied from openQA Project - action #157690: Simple global limit of registered/online workers size:MResolvedmkittler2024-03-21

Actions
Actions #1

Updated by okurz 20 days ago

  • Copied from action #157690: Simple global limit of registered/online workers size:M added
Actions #2

Updated by okurz 19 days ago

  • Target version changed from Ready to Tools - Next
Actions #3

Updated by okurz 15 days ago

  • Target version changed from Tools - Next to Ready
Actions #4

Updated by okurz 15 days ago

  • Subject changed from Limit connected online workers based on websocket+scheduler load to Limit connected online workers based on websocket+scheduler load size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by okurz 10 days ago

  • Related to action #166802: Recover worker37, worker38, worker39 size:S added
Actions

Also available in: Atom PDF