Project

General

Profile

Actions

action #168244

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #158110: [epic] Prevent worker overload

reconsider load calculation for worker load limit especially for ppc size:S

Added by okurz 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

With #158125 we have a worker load limit which helps but there can still be cases like happened 2024-10-14 on mania with the load going way above the configured load limit. Regarding load15 and such I was thinking of looking at the combination of load values, e.g. only start jobs if max(load1, load5, load15) < load_limit

Acceptance criteria

  • AC1: ppc workers consistently do not alert about too high load
  • AC2: ppc worker instance numbers are unchanged
  • AC3: Other architectures use the same algorithm

Suggestions

  • Just looking at load15 brings the problem that if many jobs start within a short time the load is not yet high so the load limit is not always effective. If we would use max(load1, load5, load15) < load_limit then maybe load1 or load5 would already be higher.
  • As alternative only start jobs if max(load1, load5, load15) < load_limit || (load1 < load_limit && load1 < load5 && load5 < load15). This way with the first part of the condition we prevent overload when jobs are picked up within seconds/minutes one after another. And with the second part of the condition we allow jobs to be picked up when the load is declining. This way we can set the load limit lower without forcing the worker to be idle for too long.

Related issues 1 (0 open1 closed)

Copied from openQA Project (public) - action #158125: typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured threshold size:MResolvedmkittler

Actions
Actions #1

Updated by okurz 2 months ago

  • Copied from action #158125: typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured threshold size:M added
Actions #2

Updated by okurz 2 months ago

  • Description updated (diff)
  • Status changed from New to In Progress
  • Assignee set to okurz
  • Target version changed from future to Ready
Actions #4

Updated by openqa_review 2 months ago

  • Due date set to 2024-10-30

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by okurz 2 months ago

  • Status changed from In Progress to Workable
Actions #6

Updated by okurz 2 months ago

  • Status changed from Workable to Feedback
Actions #7

Updated by okurz 2 months ago

  • Subject changed from reconsider load calculation for worker load limit to reconsider load calculation for worker load limit especially for ppc size:S
  • Description updated (diff)
Actions #8

Updated by okurz about 2 months ago

  • Status changed from Feedback to In Progress
Actions #9

Updated by okurz about 2 months ago

  • Status changed from In Progress to Feedback
Actions #10

Updated by okurz about 2 months ago

merged. Waiting for the change to be deployed to OSD workers and monitoring results the next time we have a bigger test queue.

Actions #11

Updated by okurz about 2 months ago

  • Priority changed from Normal to Low
Actions #12

Updated by okurz about 2 months ago

Deployed on OSD, trying on mania

openqa-clone-job --repeat 400 --within-instance https://openqa.suse.de/tests/15747266 {TEST,BUILD}+=-poo168244-okurz _GROUP=0 WORKER_CLASS+=,mania

https://openqa.suse.de/tests/overview?version=15-SP7&distri=sle&build=32.2-poo168244-okurz

Actions #13

Updated by okurz about 2 months ago

  • Due date deleted (2024-10-30)
  • Status changed from Feedback to Resolved

I have two screenshots but can't upload them. However load like on https://stats.openqa-monitor.qa.suse.de/d/WDmania/worker-dashboard-mania?viewPanel=54694&orgId=1&from=now-7d&to=now looks sane and better now and no overshoot.

Actions

Also available in: Atom PDF