Project

General

Profile

Actions

action #157690

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #157669: websockets+scheduler improvements to support more online worker instances

Simple global limit of registered/online workers size:M

Added by okurz 9 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2024-03-21
Due date:
% Done:

0%

Estimated time:

Description

Motivation

As observed in #157666 we seem to have a problem when too many openQA workers are connected at the same time. Similar to the global job limit in #129619 we should add a simple, configurable global limit of how many workers can be online at the same time to one openQA instance.

Acceptance criteria

  • AC1: A KISS configurable for number of online workers exists
  • AC2: Rejected openQA workers exceeding the mentioned limit explicitly log or fail that situation

Rollback actions

Suggestions

  • In the openQA web API reject openQA worker registration or handling if a global, configurable limit is exceeded. Maybe it makes also sense to allow the registration but prevent the creation of the web socket connection.
  • Select a sensible default, e.g. 1k.
  • Explicitly log or fail the openQA worker if rejected. A worker could be registered and be tracked as "offline" while rejected and not connected with at best an error message visible in the web UI. If too complicated start with something simpler, e.g. fatal fails of the worker instance.
  • Come up with an approach that allows a worker to attempt a connection again at some point, e.g. the web UI could send a waiting period with the rejection. The normal retry on failure would probably just cause more noise.
  • Consider to use OpenQA::Utils::usleep_backoff same as is done in the chunk uploadingioned limit explicitly log or fail that situation

Related issues 3 (2 open1 closed)

Related to openQA Infrastructure (public) - action #167557: OSD not starting new jobs on 2024-09-28 due to >1k worker instances connected, overloading websocket serverResolvedokurz2024-09-28

Actions
Related to openQA Infrastructure (public) - action #166802: Recover worker37, worker38, worker39 size:SBlockedokurz

Actions
Copied to openQA Project (public) - action #168178: Limit connected online workers based on websocket+scheduler load size:MWorkable

Actions
Actions

Also available in: Atom PDF