Project

General

Profile

Actions

action #158910

open

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #158110: [epic] Prevent worker overload

typing issue on ppc64 worker - reconsider number of worker instances in particular on ppc64le kvm tests size:M

Added by okurz 17 days ago. Updated about 8 hours ago.

Status:
Feedback
Priority:
Low
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
2024-06-07 (Due in 39 days)
% Done:

0%

Estimated time:
Tags:

Description

Motivation

We observed that e.g. petrol is running 8 openQA worker instance which with 16 logical cores leaves no headroom for the system. Probably we need to lower that number

Acceptance criteria

  • AC1: diesel+petrol+mania consistently do not alert on too high system load
  • AC2: diesel+petrol+mania use an efficient number of instances, i.e. the highest number possible

Suggestions

  • Try a lower load limit threshold
  • Reduce number of instances
  • Monitor

Files


Related issues 1 (0 open1 closed)

Copied from openQA Project - action #158125: typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured threshold size:MResolvedmkittler

Actions
Actions #1

Updated by okurz 17 days ago

  • Copied from action #158125: typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured threshold size:M added
Actions #2

Updated by okurz 17 days ago · Edited

  • Due date set to 2024-04-26
  • Status changed from In Progress to Feedback
Actions #3

Updated by okurz 11 days ago · Edited

  • Due date changed from 2024-04-26 to 2024-06-07
  • Priority changed from Normal to Low
  • Target version changed from Ready to Tools - Next

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/780 (merged) to also run with lower load limit on diesel+mania. After that based on results maybe we can increase the worker instances on both petrol and mania. diesel is still running with the original 8 ones.

Actions #4

Updated by okurz 8 days ago · Edited

It looks like the load limit is rather effective keeping the load mostly below 10 as visible on Screenshot_20240421_184442_system_load_mania_load_limit_10_in_action.png in the right-hand side.

Using instance numbers as in before:
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/787 (merged)

After that back to monitoring and should block on #158116

Actions #5

Updated by jbaier_cz 3 days ago

  • Subject changed from typing issue on ppc64 worker - reconsider number of worker instances in particular on ppc64le kvm tests to typing issue on ppc64 worker - reconsider number of worker instances in particular on ppc64le kvm tests size:M
  • Description updated (diff)
Actions #6

Updated by okurz about 8 hours ago

https://monitor.qa.suse.de/d/WDpetrol/worker-dashboard-petrol?orgId=1&from=1714095329948&to=1714174133695 showed an alert about too high system load on petrol reaching numbers of load15 > 100 for a period of 45m. I think we should reduce the instances again on petrol+diesel by 1.

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/797

Actions

Also available in: Atom PDF