Project

General

Profile

action #158125

Updated by okurz about 1 month ago

## Motivation 
 In #158104 we observed typing issues due to mania being overloaded. mania was configured to run 30 openQA worker instances and that was mostly fine as proven in #139271-24. The recent overload was likely triggered by enabling video again as part of #157636. I already reduced the number of worker instances. But this has the drawback that again the long test backlog takes longer to be finished. We should be more flexible in using available ressource. Here I suggest to implement a check in the worker to only pick up new jobs if CPU load is below a configured threshold. 

 ## Acceptance criteria 
 * **AC1:** An openQA worker does not start an openQA job if the CPU load is higher than configured threshold 
 * **AC2:** By default the worker still picks pick up jobs if the load is not too high 

 ## Suggestions 
 * Possibly the worker code somewhere in https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Worker.pm#L472 can be extended to check the cpu load, e.g. "load15", in GNU/Linux, on the openQA worker load and if it exceeds a (configurable) threshold then skip picking up any next job 
 * Or the openQA worker then decides to not even advertise itself, i.e. not connect or disconnect from the webUI instance 
 * Add a sensible disabled default value in https://github.com/os-autoinst/openQA/blob/master/etc/openqa/workers.ini with an explanation comment 

 ## Out of scope 
 * Consider the existing grafana monitoring for "broken workers" if we use that feature of declaring as "broken" due to too high CPU load

Back