Actions
coordination #158110
opencoordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
[epic] Prevent worker overload
Start date:
2024-03-27
Due date:
% Done:
57%
Estimated time:
(Total: 0.00 h)
Description
Motivation¶
For long overloaded openQA workers can cause "typing issues" and other random issues causing annoying sporadic test failures unless openQA worker machines are carefully configured to not be overloaded which is normally done with running just a limited number of worker instances accounting for cases when all worker instances work on rather resource heavy jobs. As a consequence in most cases openQA hardware can be severly underused. To make more efficient use of hardware resources while keeping openQA jobs as stable as possible openQA must ensure itself that resources are not exhausted.
Ideas¶
- Only pick up (or start) new jobs if CPU load is below configured threshold -> #158125
- Overload openQA systems on purpose to find out which system parameters are critical and define according feature requests for each relevant system parameter to be automatically handled by openQA, e.g. check CPU load, available memory, I/O rate, storage space, etc.
- before starting jobs
- while running jobs
- after jobs failed
Actions