Actions
action #168178
opencoordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
coordination #157669: websockets+scheduler improvements to support more online worker instances
Limit connected online workers based on websocket+scheduler load size:M
Start date:
Due date:
% Done:
0%
Estimated time:
Description
Motivation¶
With #157690 the amount of connected online workers is already limited based on a configuration variable. We can extend that to limit based on the actual websocket+scheduler load meaning to keep the number low enough to ensure proper operation of websocket+scheduler to prevent problems like #157666.
Acceptance criteria¶
- AC1: A clear definition of "websocket+scheduler load" exists
- AC2: The number of online workers is limited to
min(configured_number,configured_load_limit)
- AC3: Rejected openQA workers exceeding the mentioned limit(s) explicitly log or fail that situation
Suggestions¶
- Look into the implemention of #157690 to see how the simple limit was implemented so far
- Come up with a definition of the critical websocket+scheduler load based on "overload experiments" which can be used as a metric for the problem seen in #157666
- Extend the simple limit with a lookup of the said metric and also prevent additional worker connections based on the metric
- Also consider disconnecting already connected workers if the metric exceeds the configured threshold
Updated by okurz about 1 month ago
- Copied from action #157690: Simple global limit of registered/online workers size:M added
Updated by okurz about 1 month ago
- Target version changed from Ready to Tools - Next
Updated by okurz about 1 month ago
- Target version changed from Tools - Next to Ready
Updated by okurz about 1 month ago
- Subject changed from Limit connected online workers based on websocket+scheduler load to Limit connected online workers based on websocket+scheduler load size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz about 1 month ago
- Related to action #166802: Recover worker37, worker38, worker39 size:S added
Actions