Project

General

Profile

Actions

coordination #110833

open

[saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

Added by okurz over 1 year ago. Updated 12 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2022-05-09
Due date:
% Done:

83%

Estimated time:
(Total: 0.00 h)
Difficulty:

Description

Motivation

Ideas

  • Test locally by scheduling something like 100k jobs and see how the scheduler scales
  • Test locally by scheduling many jobs on something like 1k worker instances and see how the scheduler scales
  • Note that there's a unit test for scalability which one might simply invoke with very high numbers for scheduled jobs and available workers

Subtasks 26 (5 open21 closed)

coordination #108209: [epic] Reduce load on OSDBlockedokurz2023-04-01

Actions
openQA Infrastructure - action #128789: [alert] Apache Response Time alert size:MResolvednicksinger2023-04-01

Actions
action #129481: Try to *reduce* number of apache workers to limit concurrent requests causing high CPU usageNew

Actions
openQA Infrastructure - action #129484: high response times on osd - Move OSD workers to o3 to prevent OSD overload size:MResolvedokurz2023-05-17

Actions
action #129487: high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? size:MRejectedokurz

Actions
action #129490: high response times on osd - Try nginx on o3 with enabled load limiting or load balancing featuresResolvedkraih

Actions
openQA Infrastructure - action #129493: high response times on osd - better nice level for velociraptorResolvedokurz

Actions
action #129619: high response times on osd - simple limit of jobs running concurrently in openQA size:MResolvedtinita2023-05-20

Actions
action #129745: Enable apache response time alert and apache log alert again after we think it's good now size:MResolvedokurz2023-05-23

Actions
action #130477: [O3]http connection to O3 repo is broken sporadically in virtualization tests, likely due to systemd dependencies on apache/nginx size:MResolvedmkittler2023-06-07

Actions
action #130636: high response times on osd - Try nginx on osd with enabled load limiting or load balancing featuresNew

Actions
action #131024: Ensure both nginx+apache are properly covered in packages+testing+documentation size:SResolveddheidler

Actions
openQA Infrastructure - action #133325: osd http response alerts - bump threshold further upRejectedokurz2023-07-25

Actions
openQA Infrastructure - action #133397: HTTP Response alert Salt alerting and autoresolving shortly size:MResolvedmkittler2023-07-26

Actions
action #134114: Ensure to call OpenQA::Setup::read_config in unit testsResolvedtinita

Actions
action #110785: OSD incident 2022-05-09: Many scheduled jobs not picked up despite idle workers, blocked by one worker instance that should be broken?Resolvedmkittler2022-05-09

Actions
coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alertResolvedokurz2023-09-07

Actions
openQA Infrastructure - action #135329: s390x work demand exceeds available workersResolvedokurz2023-09-07

Actions
action #135362: Optimize worker status update handling in websocket server size:MResolvedkraih2023-09-07

Actions
openQA Infrastructure - action #135380: A significant number of scheduled jobs with one or two running triggers an alertResolvedokurz2023-09-07

Actions
action #135407: [tools] Measure to mitigate websockets overload by workers and revert it size:MResolvedlivdywan2023-09-08

Actions
action #135482: Move to systemd journal only on o3+osd (was: Missing openqa_websockets log file on OSD for websocket server) size:MRejectedokurz2023-09-11

Actions
openQA Infrastructure - action #135578: Long job age and jobs not executed for long size:MResolvednicksinger

Actions
coordination #139010: [epic] Long OSD ppc64le job queueNew2023-11-04

Actions
openQA Infrastructure - action #139100: Long OSD ppc64le job queue - Move nue3 power8 machines to nue2Resolvedokurz2023-11-04

Actions
openQA Infrastructure - action #139103: Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:MBlockedokurz2023-11-04

Actions

Related issues 1 (0 open1 closed)

Copied from openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old resultsResolvedokurz2020-03-18

Actions
Actions #1

Updated by okurz over 1 year ago

  • Copied from coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results added
Actions #2

Updated by okurz over 1 year ago

  • Related to action #110785: OSD incident 2022-05-09: Many scheduled jobs not picked up despite idle workers, blocked by one worker instance that should be broken? added
Actions #3

Updated by okurz 3 months ago

  • Subtask #135122 added
Actions #4

Updated by okurz about 1 month ago

  • Subtask #139010 added
Actions

Also available in: Atom PDF