action #160478: Try out higher global openQA job limit on OSD again after switch to nginx size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #160478

closed

openQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project (public) - coordination #108209: [epic] Reduce load on OSD

Try out higher global openQA job limit on OSD again after switch to nginx size:S

Added by okurz 10 months ago. Updated 9 months ago.

Status:

Resolved

Priority:

Low

Assignee:

okurz

Category:

Feature requests

Target version:

openQA Project (public) - Ready

Start date:

2023-08-31

Due date:

% Done:

Estimated time:

Tags:

osd, infra, load, response, 503

Description

Motivation¶

We switched OSD to nginx as even with a lower global openQA job limit as defined in #134927 we could not prevent unresponsiveness of OSD. So far we have not reproduced an unresponsiveness with nginx in place. Now we could try out the effect of using a higher global job limit again.

Acceptance criteria¶

AC1: The global openQA job limit is as high as possible while still not causing higher chances of unresponiveness

Suggestions¶

Try out higher limits while closely monitoring the job queues on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test as well as the load and responsiveness on OSD on https://monitor.qa.suse.de/d/WebuiDb/webui-summary

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz 10 months ago

Copied from action #134927: OSD throws 503, unresponsive for some minutes size:M added

Actions

Copy link

Updated by okurz 10 months ago

Parent task set to #108209

Actions

Copy link

Updated by okurz 10 months ago

I changed from 340 to 600 now and called systemctl restart openqa-{webui,scheduler,websockets}, monitoring impact

Actions

Copy link

Updated by okurz 10 months ago · Edited

As visible on https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1715946732089&to=1715966651540 about 4h after I increased the job limit the system load increased and the CPU usage maxxed out at 100% during a longer time. Also there are peaks in the http response time exceeding 4s although not for long and not "completely unresponsive" periods so far. Also there seem to be more "broken workers". I assume we should reduce a bit again. Going to 420.

Actions

Copy link

Updated by okurz 10 months ago

Status changed from In Progress to Feedback

So far 420 still seems good to not exceed system load by much. We have still encountered #159396. Also I see that for bigger schedules of openQA tests a longer queue of minion jobs pile up but eventually they are being worked on.

Actions

Copy link

Updated by livdywan 9 months ago

Subject changed from Try out higher global openQA job limit on OSD again after switch to nginx to Try out higher global openQA job limit on OSD again after switch to nginx size:s

Actions

Copy link

Updated by okurz 9 months ago

Subject changed from Try out higher global openQA job limit on OSD again after switch to nginx size:s to Try out higher global openQA job limit on OSD again after switch to nginx size:S

Actions

Copy link

Updated by okurz 9 months ago

Status changed from Feedback to Resolved

So all good so far. I feel like we can't further for now.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #160478

Try out higher global openQA job limit on OSD again after switch to nginx size:S

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz 10 months ago

Updated by okurz 10 months ago

Updated by okurz 10 months ago

Updated by okurz 10 months ago · Edited

Updated by okurz 10 months ago

Updated by livdywan 9 months ago

Updated by okurz 9 months ago

Updated by okurz 9 months ago