action #160478
closedopenQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project (public) - coordination #108209: [epic] Reduce load on OSD
Try out higher global openQA job limit on OSD again after switch to nginx size:S
0%
Description
Motivation¶
We switched OSD to nginx as even with a lower global openQA job limit as defined in #134927 we could not prevent unresponsiveness of OSD. So far we have not reproduced an unresponsiveness with nginx in place. Now we could try out the effect of using a higher global job limit again.
Acceptance criteria¶
- AC1: The global openQA job limit is as high as possible while still not causing higher chances of unresponiveness
Suggestions¶
- Try out higher limits while closely monitoring the job queues on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test as well as the load and responsiveness on OSD on https://monitor.qa.suse.de/d/WebuiDb/webui-summary
Updated by okurz 7 months ago
- Copied from action #134927: OSD throws 503, unresponsive for some minutes size:M added
Updated by okurz 7 months ago ยท Edited
As visible on https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1715946732089&to=1715966651540 about 4h after I increased the job limit the system load increased and the CPU usage maxxed out at 100% during a longer time. Also there are peaks in the http response time exceeding 4s although not for long and not "completely unresponsive" periods so far. Also there seem to be more "broken workers". I assume we should reduce a bit again. Going to 420.
Updated by okurz 7 months ago
- Status changed from In Progress to Feedback
So far 420 still seems good to not exceed system load by much. We have still encountered #159396. Also I see that for bigger schedules of openQA tests a longer queue of minion jobs pile up but eventually they are being worked on.