action #158910
opencoordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
coordination #158110: [epic] Prevent worker overload
typing issue on ppc64 worker - reconsider number of worker instances in particular on ppc64le kvm tests size:M
0%
Description
Motivation¶
We observed that e.g. petrol is running 8 openQA worker instance which with 16 logical cores leaves no headroom for the system. Probably we need to lower that number
Acceptance criteria¶
- AC1: diesel+petrol+mania consistently do not alert on too high system load
- AC2: diesel+petrol+mania use an efficient number of instances, i.e. the highest number possible
Suggestions¶
- Try a lower load limit threshold
- Reduce number of instances
- Monitor
Files
Updated by okurz 7 months ago
- Copied from action #158125: typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured threshold size:M added
Updated by okurz 7 months ago · Edited
- Due date changed from 2024-04-26 to 2024-06-07
- Priority changed from Normal to Low
- Target version changed from Ready to Tools - Next
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/780 (merged) to also run with lower load limit on diesel+mania. After that based on results maybe we can increase the worker instances on both petrol and mania. diesel is still running with the original 8 ones.
Updated by okurz 6 months ago · Edited
- File Screenshot_20240421_184442_system_load_mania_load_limit_10_in_action.png Screenshot_20240421_184442_system_load_mania_load_limit_10_in_action.png added
It looks like the load limit is rather effective keeping the load mostly below 10 as visible on in the right-hand side.
Using instance numbers as in before:
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/787 (merged)
After that back to monitoring and should block on #158116
Updated by okurz 6 months ago
https://monitor.qa.suse.de/d/WDpetrol/worker-dashboard-petrol?orgId=1&from=1714095329948&to=1714174133695 showed an alert about too high system load on petrol reaching numbers of load15 > 100 for a period of 45m. I think we should reduce the instances again on petrol+diesel by 1.
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/797