Project

General

Profile

action #168178

Updated by mkittler 1 day ago

## Motivation 
 With #157690 the amount of connected online workers is already limited based on a configuration variable. We can extend that to limit based on the actual websocket+scheduler load meaning to keep the number low enough to ensure proper operation of websocket+scheduler to prevent problems like #157666. 

 ## Acceptance criteria 
 * **AC1:** A clear definition of "websocket+scheduler load" exists 
 * **AC2:** The number of online workers is limited to `min(configured_number,configured_load_limit)` 
 * **AC3:** Rejected openQA workers exceeding the mentioned limit(s) explicitly log or fail that situation 

 ## Suggestions 
 * Look into the implemention of #157690 to see how the simple limit was implemented so far 
 * Come up with a definition of the critical websocket+scheduler load based on "overload experiments" which can be used as a metric for the problem seen in #157666 
 * Extend the simple limit with a lookup of the said metric and also prevent additional worker connections based on the metric 
 * Also consider disconnecting already connected workers if the metric exceeds the configured threshold 

 ## Rollback steps 
 * Ensure `sapworker2.qe.nue2.suse.org` is powered down as is/was used when working on this ticket to create many workers.

Back