action #157690: Simple global limit of registered/online workers size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #157690

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #157669: websockets+scheduler improvements to support more online worker instances

Simple global limit of registered/online workers size:M

Added by okurz about 1 year ago. Updated 5 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Feature requests

Target version:

Ready

Start date:

2024-03-21

Due date:

% Done:

Estimated time:

Description

Motivation¶

As observed in #157666 we seem to have a problem when too many openQA workers are connected at the same time. Similar to the global job limit in #129619 we should add a simple, configurable global limit of how many workers can be online at the same time to one openQA instance.

Acceptance criteria¶

AC1: A KISS configurable for number of online workers exists
AC2: Rejected openQA workers exceeding the mentioned limit explicitly log or fail that situation

Rollback actions¶

R1: Remove silence alertname=Broken workers alert https://stats.openqa-monitor.qa.suse.de/alerting/silences

alertname=Broken workers alert

Suggestions¶

In the openQA web API reject openQA worker registration or handling if a global, configurable limit is exceeded. Maybe it makes also sense to allow the registration but prevent the creation of the web socket connection.
Select a sensible default, e.g. 1k.
Explicitly log or fail the openQA worker if rejected. A worker could be registered and be tracked as "offline" while rejected and not connected with at best an error message visible in the web UI. If too complicated start with something simpler, e.g. fatal fails of the worker instance.
Come up with an approach that allows a worker to attempt a connection again at some point, e.g. the web UI could send a waiting period with the rejection. The normal retry on failure would probably just cause more noise.
Consider to use OpenQA::Utils::usleep_backoff same as is done in the chunk uploadingioned limit explicitly log or fail that situation

Related issues 3 (2 open — 1 closed)

Actions

Copy link

Updated by okurz 6 months ago

Related to action #167557: OSD not starting new jobs on 2024-09-28 due to >1k worker instances connected, overloading websocket server added

Actions

Copy link

Updated by okurz 6 months ago

Target version changed from future to Ready

Today #167557 happened hence adding this to the backlog.

Actions

Copy link

Updated by okurz 6 months ago

Description updated (diff)

Actions

Copy link

Updated by mkittler 6 months ago

Description updated (diff)

Actions

Copy link

Updated by okurz 6 months ago

Subject changed from Simple global limit of registered/online workers to Simple global limit of registered/online workers size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by mkittler 6 months ago

Status changed from Workable to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler 6 months ago

Status changed from In Progress to Feedback

PR: https://github.com/os-autoinst/openQA/pull/5988

Actions

Copy link

Updated by okurz 6 months ago · Edited

Description updated (diff)

PR merged. And the change is already deployed to OSD and with the default value it should already be effective. Feel welcome to try it out or just resolve trusting that the change is enough :)

Actions

Copy link

Updated by mkittler 6 months ago

Status changed from Feedback to Resolved

Actions

Copy link

#10

Updated by okurz 6 months ago

Copied to action #168178: Limit connected online workers based on websocket+scheduler load size:M added

Actions

Copy link

#11

Updated by okurz 6 months ago · Edited

Status changed from Resolved to Feedback

I brought up w38+w39 as part of #166802 and now https://openqa.suse.de/admin/workers shows 1038 online workers. Shouldn't that be prevented by the default limit of 1k? I think what could be happening is that while the scheduler tries to limit there are still connection attempts coming in and AFAIK we don't disconnected workers that "made it in". So I now set a number of 900 in OSD locally in /etc/openqa/openqa.ini to see if that has an effect.

Actions

Copy link

#12

Updated by okurz 6 months ago

Related to action #166802: Recover worker37, worker38, worker39 size:S added

Actions

Copy link

#13

Updated by tinita 6 months ago

okurz wrote in #note-11:

So I now set a number of 900 in OSD locally in /etc/openqa/openqa.ini to see if that has an effect.

> grep max_online /etc/openqa/openqa.ini
max_online_workers = 9000

Actions

Copy link

#14

Updated by mkittler 6 months ago

It works now after putting actually 900 into the config file and restarting the websocket server. The INI file states a wrong default for the re-connect delay but @okurz is creating a PR to fix that.

Actions

Copy link

#15

Updated by okurz 6 months ago

Description updated (diff)
Status changed from Feedback to In Progress

Actions

Copy link

#16

Updated by openqa_review 6 months ago

Due date set to 2024-10-29

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#17

Updated by mkittler 6 months ago · Edited

I could not reproduce the issue that the limit is sometimes not effective by extending the ws server and scalability tests. See https://github.com/os-autoinst/openQA/pull/6009 for my changes.

I actually tried harder than this PR and let the ws restart with a lowered limit - just like what we did in production. You can see the change here (the last commit is the interesting one).
When I run this locally (e.g. SCALABILITY_TEST_WITH_OFFLINE_WEBUI_HOST=0 SCALABILITY_TEST_JOB_COUNT=10 SCALABILITY_TEST_WORKER_COUNT=45 SCALABILITY_TEST_WORKER_LIMIT=50 SCALABILITY_TEST_WORKER_LIMIT_2=25 prove -l -v t/43-scheduling-and-worker-scalability.t) it always works. It makes probably no sense to commit this change as-is (due to the sleep and I'm not sure whether we want to enable the restart subtest in the CI).

Actions

Copy link

#18

Updated by mkittler 6 months ago

Status changed from In Progress to Feedback

I'm only waiting for an additional review on the PR because otherwise I don't know what else to improve. The online limit is not always working but most of the time at least. We should probably focus on fixing the actual performance problems now.

Actions

Copy link

#19

Updated by mkittler 6 months ago

Status changed from Feedback to Resolved

The PR has been merged.

Actions

Copy link

#20

Updated by okurz 5 months ago

Due date deleted (~~2024-10-29~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #157690

Simple global limit of registered/online workers size:M

Motivation¶

Acceptance criteria¶

Rollback actions¶

Suggestions¶

Updated by okurz 6 months ago

Updated by okurz 6 months ago

Updated by okurz 6 months ago

Updated by mkittler 6 months ago

Updated by okurz 6 months ago

Updated by mkittler 6 months ago

Updated by mkittler 6 months ago

Updated by okurz 6 months ago · Edited

Updated by mkittler 6 months ago

Updated by okurz 6 months ago

Updated by okurz 6 months ago · Edited

Updated by okurz 6 months ago

Updated by tinita 6 months ago

Updated by mkittler 6 months ago

Updated by okurz 6 months ago

Updated by openqa_review 6 months ago

Updated by mkittler 6 months ago · Edited

Updated by mkittler 6 months ago

Updated by mkittler 6 months ago

Updated by okurz 5 months ago