action #168178: Limit connected online workers based on websocket+scheduler load size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #168178

open

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #157669: websockets+scheduler improvements to support more online worker instances

Limit connected online workers based on websocket+scheduler load size:M

Added by okurz 7 months ago. Updated 19 days ago.

Status:

Workable

Priority:

Low

Assignee:

Category:

Feature requests

Target version:

Ready

Start date:

Due date:

% Done:

Estimated time:

Description

Motivation¶

With #157690 the amount of connected online workers is already limited based on a configuration variable. We can extend that to limit based on the actual websocket+scheduler load meaning to keep the number low enough to ensure proper operation of websocket+scheduler to prevent problems like #157666.

Acceptance criteria¶

AC1: A clear definition of "websocket+scheduler load" exists
AC2: The number of online workers is limited to min(configured_number,configured_load_limit)
AC3: Rejected openQA workers exceeding the mentioned limit(s) explicitly log or fail that situation

Suggestions¶

Look into the implementation of #157690 to see how the simple limit was implemented so far
Come up with a definition of the critical websocket+scheduler load based on "overload experiments" which can be used as a metric for the problem seen in #157666
Extend the simple limit with a lookup of the said metric and also prevent additional worker connections based on the metric
Also consider disconnecting already connected workers if the metric exceeds the configured threshold
Consider that the configured limit is now (as of https://github.com/os-autoinst/openQA/pull/6358) used to increase the Mojolicious limit of connections. This means the limit is not as low anymore as it previously was, see #181784.

Rollback steps¶

~~DONE: Ensure sapworker2.qe.nue2.suse.org is powered down as is/was used when working on this ticket to create many workers.~~

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #168178

Limit connected online workers based on websocket+scheduler load size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Rollback steps¶

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 6 months ago

Updated by okurz 4 months ago

Updated by mkittler 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 2 months ago

Updated by mkittler about 2 months ago · Edited

Updated by mkittler about 2 months ago

Updated by livdywan about 2 months ago

Updated by livdywan about 2 months ago

Updated by mkittler about 1 month ago

Updated by mkittler about 1 month ago

Updated by okurz 26 days ago

Updated by mkittler 24 days ago

Updated by mkittler 19 days ago

Updated by mkittler 19 days ago

Updated by mkittler 19 days ago

Updated by livdywan 19 days ago