Project

General

Profile

Actions

action #160478

closed

openQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project (public) - coordination #108209: [epic] Reduce load on OSD

Try out higher global openQA job limit on OSD again after switch to nginx size:S

Added by okurz 7 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Feature requests
Start date:
2023-08-31
Due date:
% Done:

0%

Estimated time:

Description

Motivation

We switched OSD to nginx as even with a lower global openQA job limit as defined in #134927 we could not prevent unresponsiveness of OSD. So far we have not reproduced an unresponsiveness with nginx in place. Now we could try out the effect of using a higher global job limit again.

Acceptance criteria

  • AC1: The global openQA job limit is as high as possible while still not causing higher chances of unresponiveness

Suggestions


Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure (public) - action #134927: OSD throws 503, unresponsive for some minutes size:MResolvedokurz2023-08-31

Actions
Actions #1

Updated by okurz 7 months ago

  • Copied from action #134927: OSD throws 503, unresponsive for some minutes size:M added
Actions #2

Updated by okurz 7 months ago

  • Parent task set to #108209
Actions #3

Updated by okurz 7 months ago

I changed from 340 to 600 now and called systemctl restart openqa-{webui,scheduler,websockets}, monitoring impact

Actions #4

Updated by okurz 7 months ago ยท Edited

As visible on https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1715946732089&to=1715966651540 about 4h after I increased the job limit the system load increased and the CPU usage maxxed out at 100% during a longer time. Also there are peaks in the http response time exceeding 4s although not for long and not "completely unresponsive" periods so far. Also there seem to be more "broken workers". I assume we should reduce a bit again. Going to 420.

Actions #5

Updated by okurz 7 months ago

  • Status changed from In Progress to Feedback

So far 420 still seems good to not exceed system load by much. We have still encountered #159396. Also I see that for bigger schedules of openQA tests a longer queue of minion jobs pile up but eventually they are being worked on.

Actions #6

Updated by livdywan 7 months ago

  • Subject changed from Try out higher global openQA job limit on OSD again after switch to nginx to Try out higher global openQA job limit on OSD again after switch to nginx size:s
Actions #7

Updated by okurz 7 months ago

  • Subject changed from Try out higher global openQA job limit on OSD again after switch to nginx size:s to Try out higher global openQA job limit on OSD again after switch to nginx size:S
Actions #8

Updated by okurz 6 months ago

  • Status changed from Feedback to Resolved

So all good so far. I feel like we can't further for now.

Actions

Also available in: Atom PDF