Project

General

Profile

Actions

coordination #158110

open

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

[epic] Prevent worker overload

Added by okurz about 1 month ago. Updated 7 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2024-03-27
Due date:
2024-06-07 (Due in 41 days)
% Done:

50%

Estimated time:
(Total: 0.00 h)

Description

Motivation

For long overloaded openQA workers can cause "typing issues" and other random issues causing annoying sporadic test failures unless openQA worker machines are carefully configured to not be overloaded which is normally done with running just a limited number of worker instances accounting for cases when all worker instances work on rather resource heavy jobs. As a consequence in most cases openQA hardware can be severly underused. To make more efficient use of hardware resources while keeping openQA jobs as stable as possible openQA must ensure itself that resources are not exhausted.

Ideas

  • Only pick up (or start) new jobs if CPU load is below configured threshold -> #158125
  • Overload openQA systems on purpose to find out which system parameters are critical and define according feature requests for each relevant system parameter to be automatically handled by openQA, e.g. check CPU load, available memory, I/O rate, storage space, etc.
    • before starting jobs
    • while running jobs
    • after jobs failed

Subtasks 6 (3 open3 closed)

openQA Infrastructure - action #158104: typing issue on ppc64 worker size:SResolvedokurz2024-03-27

Actions
openQA Infrastructure - action #158113: typing issue on ppc64 worker - make CPU load alert more strict size:MResolvedokurz2024-03-27

Actions
openQA Infrastructure - action #158116: typing issue on ppc64 worker - crosscheck performance impact of ffmpeg on ppc64le (Power8 kvm) size:MWorkable2024-03-27

Actions
action #158125: typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured threshold size:MResolvedmkittler

Actions
openQA Infrastructure - action #158709: typing issue on ppc64 worker - with automatic CPU load based limiting in place let's increase the instances on mania againNew

Actions
action #158910: typing issue on ppc64 worker - reconsider number of worker instances in particular on ppc64le kvm tests size:MFeedbackokurz2024-06-07

Actions
Actions #1

Updated by okurz about 1 month ago

  • Subtask #158113 added
Actions #2

Updated by okurz about 1 month ago

  • Subtask #158104 added
Actions #3

Updated by okurz about 1 month ago

  • Subtask #158116 added
Actions #4

Updated by okurz about 1 month ago

  • Subtask #158125 added
Actions #5

Updated by okurz 18 days ago

  • Subtask #158709 added
Actions #6

Updated by okurz 18 days ago

  • Description updated (diff)
Actions #7

Updated by okurz 15 days ago

  • Subtask #158910 added
Actions

Also available in: Atom PDF