action #168178: Limit connected online workers based on websocket+scheduler load size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #168178

open

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #157669: websockets+scheduler improvements to support more online worker instances

Limit connected online workers based on websocket+scheduler load size:M

Added by okurz 7 months ago. Updated 19 days ago.

Status:

Workable

Priority:

Low

Assignee:

Category:

Feature requests

Target version:

Ready

Start date:

Due date:

% Done:

Estimated time:

Description

Motivation¶

With #157690 the amount of connected online workers is already limited based on a configuration variable. We can extend that to limit based on the actual websocket+scheduler load meaning to keep the number low enough to ensure proper operation of websocket+scheduler to prevent problems like #157666.

Acceptance criteria¶

AC1: A clear definition of "websocket+scheduler load" exists
AC2: The number of online workers is limited to min(configured_number,configured_load_limit)
AC3: Rejected openQA workers exceeding the mentioned limit(s) explicitly log or fail that situation

Suggestions¶

Look into the implementation of #157690 to see how the simple limit was implemented so far
Come up with a definition of the critical websocket+scheduler load based on "overload experiments" which can be used as a metric for the problem seen in #157666
Extend the simple limit with a lookup of the said metric and also prevent additional worker connections based on the metric
Also consider disconnecting already connected workers if the metric exceeds the configured threshold
Consider that the configured limit is now (as of https://github.com/os-autoinst/openQA/pull/6358) used to increase the Mojolicious limit of connections. This means the limit is not as low anymore as it previously was, see #181784.

Rollback steps¶

~~DONE: Ensure sapworker2.qe.nue2.suse.org is powered down as is/was used when working on this ticket to create many workers.~~

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by okurz 7 months ago

Copied from action #157690: Simple global limit of registered/online workers size:M added

Actions

Copy link

Updated by okurz 7 months ago

Target version changed from Ready to Tools - Next

Actions

Copy link

Updated by okurz 7 months ago

Target version changed from Tools - Next to Ready

Actions

Copy link

Updated by okurz 7 months ago

Subject changed from Limit connected online workers based on websocket+scheduler load to Limit connected online workers based on websocket+scheduler load size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz 7 months ago

Related to action #166802: Recover worker37, worker38, worker39 size:S added

Actions

Copy link

Updated by okurz 6 months ago

Target version changed from Ready to Tools - Next

Actions

Copy link

Updated by okurz 4 months ago

Target version changed from Tools - Next to Ready

Actions

Copy link

Updated by mkittler 3 months ago

I don't remember exactly how we envisioned this to work.

Did we have figures supporting the fact that the load of the websocket server and scheduler are actually high in the problematic situation? I'm asking that because I highly doubt that this is the case. We have already established that at least the websocket server does not cause much CPU load. The same is probably true for the scheduler. I also doubt that both cause a considerable amount of I/O load. Maybe the worker processes of PostgreSQL cause a high I/O load (or even CPU load) instead (which would make it hard to pin-down this load to the scheduling problem). Maybe none of the processes cause a high load because locks held by some txn are the bottleneck.

I would probably approach this by trying to solve the actual problem first - which means provoking it in some way locally with the help of some servers with enough RAM¹. Then I'd closely observe the resource usage locally and what causes this exactly. Depending on the findings fix the problem itself might be simpler than adding a dynamic limit in case the problem occurs². So adding a dynamic limit might not make sense and adding it blindly without knowing what to look for even less.

So I guess by trying to fight the symptoms first we're approaching the problem from the wrong angle.

¹ When working on this before I noticed that with "only" 32 GiB RAM the number of worker slots I could start on my laptop was quite limited.
² Adding a dynamic limit isn't that trivial.

Actions

Copy link

Updated by okurz 3 months ago

Target version changed from Ready to Tools - Next

Actions

Copy link

#10

Updated by okurz 3 months ago

mkittler wrote in #note-8:

I would probably approach this by trying to solve the actual problem first - which means provoking it in some way locally with the help of some servers with enough RAM¹. Then I'd closely observe the resource usage locally and what causes this exactly.

I agree. That is how I interpret the suggestion "Come up with a definition of the critical websocket+scheduler load based on 'overload experiments'"

Actions

Copy link

#12

Updated by okurz 2 months ago

Target version changed from Tools - Next to Ready

Actions

Copy link

#13

Updated by mkittler about 2 months ago · Edited

Assignee set to mkittler

Solving the problem by removing a bottle neck or problematic behavior in general is not going to bring us to any kind of definition of load. If it is possible to solve this without adding a limit I would also avoid adding a limit (because additional monitoring in order to enforce a limit would add complexity).

I can still move this ticket forward to some extend even though the ACs might not be useful in the end.

EDIT: I will use sapworker2-sp.qe.nue2.suse.org which is powered off anyway to create many worker instances.

Actions

Copy link

#14

Updated by mkittler about 2 months ago

Description updated (diff)
Status changed from Workable to In Progress

Actions

Copy link

#15

Updated by livdywan about 2 months ago

Status changed from In Progress to Workable

Actions

Copy link

#16

Updated by livdywan about 2 months ago

Maybe we can discuss this in the unblock? I think we talked about it but nobody added notes here.

Actions

Copy link

#17

Updated by mkittler about 1 month ago

Status changed from Workable to Feedback

My change https://github.com/os-autoinst/openQA/pull/6358 hasn't received much reviews yet. I think we can merge it even though it might not be all that's needed. It is definitely a step in the right direction. It would allow us to increase the online worker limit in production to e.g. 1500 and see whether it now scales better. If not we can still go back to 900 and see what else can be improved. When testing it locally (with 1500 workers from sapworker2-sp.qe.nue2.suse.org) this PR was definitely an improvement.

Actions

Copy link

#18

Updated by mkittler about 1 month ago

The PR was merged 4 days ago and hasn't caused any problems on o3 so far.

When OSD is recovered and the situation has settled down we could try raising the worker limit there. Maybe my changes so far are already enough.

If I have ideas for other improvements I can come up with another PR but so far I don't have a clear idea.

Actions

Copy link

#19

Updated by okurz 26 days ago

Status changed from Feedback to Workable

Actions

Copy link

#20

Updated by mkittler 24 days ago

To test this I would increase misc_limits/max_online_workers from 960 to 1500 on OSD. We currently have 736 workers online. So we need around 300 additional slots to be a bit over 1000. Bringing back workers 37 to 40 would only add 200 slots. So maybe I'll start a few more jobs besides to what we normally configure on those hosts.

Actions

Copy link

#21

Updated by mkittler 19 days ago

Status changed from Workable to Feedback

Status from Wednesday evening:

It looks like the additional lots I created on worker40 are not running anymore. Probably salt took care of disabling those slots again. I keep them disabled. So since we're now only at 928 worker slots the experiment has kind of terminated itself. So I don't expect any problems to occur.

However, until the experiment was terminated (supposedly by Salt) the situation looked good with over 1000 worker slots temporarily (for a few hours). New jobs were still assigned and executed. So I would still count that experiment a success. Probably bumping the connection limit for websocket connections is really what was needed and the change to avoid too frequent status updates might have helped a little bit, too.

All additional worker hosts are still online btw. Only the additional worker slots I created on worker40 were stopped. Should I shut them down again? I noticed that there is actually no real need for so many worker slots. So for saving power it would make sense to shut them down again. The only downside is that those workers will lag behind when it comes to updates and other changes (which also caused a few problems when I brought them back online last week).

Actions

Copy link

#22

Updated by mkittler 19 days ago

Description updated (diff)
Status changed from Feedback to Workable

I created #181784 and will unassign from this ticket. We might want to reconsider whether we want to keep it in ready.

I masked worker slots on workers 37 to 40 again via sudo systemctl mask --now $(openqa-worker-services --masking), powered them off and removed them from Salt via for h in worker{37..40}.oqa.prg2.suse.org; do sudo salt-key -d $h; done.

sapworker2.qe.nue2.suse.org is also powered down.

Actions

Copy link

#23

Updated by mkittler 19 days ago

Assignee changed from mkittler to okurz

Not sure how to unassign from a ticket (has something changed after a progress update?) so I selected @okurz from the "Author / Previous assignee" section of the drop-down menu.

Actions

Copy link

#24

Updated by livdywan 19 days ago

Assignee deleted (~~okurz~~)

mkittler wrote in #note-23:

Not sure how to unassign from a ticket (has something changed after a progress update?) so I selected @okurz from the "Author / Previous assignee" section of the drop-down menu.

There should be an empty option at the top of the list.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #168178

Limit connected online workers based on websocket+scheduler load size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Rollback steps¶

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 6 months ago

Updated by okurz 4 months ago

Updated by mkittler 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 2 months ago

Updated by mkittler about 2 months ago · Edited

Updated by mkittler about 2 months ago

Updated by livdywan about 2 months ago

Updated by livdywan about 2 months ago

Updated by mkittler about 1 month ago

Updated by mkittler about 1 month ago

Updated by okurz 26 days ago

Updated by mkittler 24 days ago

Updated by mkittler 19 days ago

Updated by mkittler 19 days ago

Updated by mkittler 19 days ago

Updated by livdywan 19 days ago