Project

General

Profile

Actions

coordination #110833

open

[saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

Added by okurz almost 2 years ago. Updated 1 day ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2022-05-09
Due date:
2024-06-07 (Due in 49 days)
% Done:

52%

Estimated time:
(Total: 0.00 h)

Description

Motivation

Ideas

  • Test locally by scheduling something like 100k jobs and see how the scheduler scales
  • Test locally by scheduling many jobs on something like 1k worker instances and see how the scheduler scales
  • Note that there's a unit test for scalability which one might simply invoke with very high numbers for scheduled jobs and available workers

Subtasks 45 (19 open26 closed)

coordination #108209: [epic] Reduce load on OSDBlockedokurz2023-04-01

Actions
openQA Infrastructure - action #128789: [alert] Apache Response Time alert size:MResolvednicksinger2023-04-01

Actions
action #129481: Try to *reduce* number of apache workers to limit concurrent requests causing high CPU usageNew

Actions
openQA Infrastructure - action #129484: high response times on osd - Move OSD workers to o3 to prevent OSD overload size:MResolvedokurz2023-05-17

Actions
action #129487: high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? size:MRejectedokurz

Actions
action #129490: high response times on osd - Try nginx on o3 with enabled load limiting or load balancing featuresResolvedkraih

Actions
openQA Infrastructure - action #129493: high response times on osd - better nice level for velociraptorResolvedokurz

Actions
action #129619: high response times on osd - simple limit of jobs running concurrently in openQA size:MResolvedtinita2023-05-20

Actions
action #129745: Enable apache response time alert and apache log alert again after we think it's good now size:MResolvedokurz2023-05-23

Actions
action #130477: [O3]http connection to O3 repo is broken sporadically in virtualization tests, likely due to systemd dependencies on apache/nginx size:MResolvedmkittler2023-06-07

Actions
action #130636: high response times on osd - Try nginx on osd with enabled load limiting or load balancing featuresNew

Actions
action #131024: Ensure both nginx+apache are properly covered in packages+testing+documentation size:SResolveddheidler

Actions
openQA Infrastructure - action #133325: osd http response alerts - bump threshold further upRejectedokurz2023-07-25

Actions
openQA Infrastructure - action #133397: HTTP Response alert Salt alerting and autoresolving shortly size:MResolvedmkittler2023-07-26

Actions
action #134114: Ensure to call OpenQA::Setup::read_config in unit testsResolvedtinita

Actions
openQA Infrastructure - action #157081: OSD unresponsive or significantly slow for some minutes 2024-03-12 08:30ZResolvedokurz2024-03-12

Actions
openQA Infrastructure - action #157666: OSD unresponsive and then not starting any more jobs on 2024-03-21Resolvedokurz2024-03-12

Actions
openQA Infrastructure - action #157726: osd-deployment | Failed pipeline for master (worker3[6-9].oqa.prg2.suse.org)Blockedokurz2024-03-18

Actions
openQA Infrastructure - action #158059: OSD unresponsive or significantly slow for some minutes 2024-03-26 13:34ZResolvedokurz

Actions
action #110785: OSD incident 2022-05-09: Many scheduled jobs not picked up despite idle workers, blocked by one worker instance that should be broken?Resolvedmkittler2022-05-09

Actions
coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alertResolvedokurz2023-09-07

Actions
openQA Infrastructure - action #135329: s390x work demand exceeds available workersResolvedokurz2023-09-07

Actions
action #135362: Optimize worker status update handling in websocket server size:MResolvedkraih2023-09-07

Actions
openQA Infrastructure - action #135380: A significant number of scheduled jobs with one or two running triggers an alertResolvedokurz2023-09-07

Actions
action #135407: [tools] Measure to mitigate websockets overload by workers and revert it size:MResolvedlivdywan2023-09-08

Actions
action #135482: Move to systemd journal only on o3+osd (was: Missing openqa_websockets log file on OSD for websocket server) size:MRejectedokurz2023-09-11

Actions
openQA Infrastructure - action #135578: Long job age and jobs not executed for long size:MResolvednicksinger

Actions
coordination #139010: [epic] Long OSD ppc64le job queueNew2023-11-04

Actions
openQA Infrastructure - action #139100: Long OSD ppc64le job queue - Move nue3 power8 machines to nue2Resolvedokurz2023-11-04

Actions
openQA Infrastructure - action #139103: Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:MBlockedokurz2023-11-04

Actions
coordination #157669: websockets+scheduler improvementsNew2023-08-31

Actions
action #134924: Websocket server overloaded, affected worker slots shown as "broken" with graceful disconnect in workers tableNew2023-08-31

Actions
action #157675: Optimize openqa-scheduler database queries, e.g. "SELECT value FROM worker_properties..."New2024-03-21

Actions
action #157681: Profiling using NYTProf for openqa-websockets and openqa-schedulerNew2024-03-21

Actions
action #157684: cycle execution health check in openqa-schedulerNew2024-03-21

Actions
action #157690: Simple global limit of registered/online workersNew2024-03-21

Actions
coordination #158110: [epic] Prevent worker overloadNew2024-03-272024-06-07

Actions
openQA Infrastructure - action #158104: typing issue on ppc64 worker size:SResolvedokurz2024-03-27

Actions
openQA Infrastructure - action #158113: typing issue on ppc64 worker - make CPU load alert more strict size:MResolvedokurz2024-03-27

Actions
openQA Infrastructure - action #158116: typing issue on ppc64 worker - crosscheck performance impact of ffmpeg on ppc64le (Power8 kvm) size:MWorkable2024-03-27

Actions
action #158125: typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured threshold size:MFeedbackmkittler2024-04-19

Actions
openQA Infrastructure - action #158709: typing issue on ppc64 worker - with automatic CPU load based limiting in place let's increase the instances on mania againNew

Actions
action #158910: typing issue on ppc64 worker - reconsider number of worker instances in particular on ppc64le kvm testsFeedbackokurz2024-06-07

Actions
coordination #158167: [epic] Increase worker capacityNewokurz2024-03-27

Actions
openQA Infrastructure - action #158170: Increase resources for s390x kvm size:MFeedbacknicksinger2024-03-27

Actions

Related issues 1 (0 open1 closed)

Copied from openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old resultsResolvedokurz2020-03-18

Actions
Actions #1

Updated by okurz almost 2 years ago

  • Copied from coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results added
Actions #2

Updated by okurz almost 2 years ago

  • Related to action #110785: OSD incident 2022-05-09: Many scheduled jobs not picked up despite idle workers, blocked by one worker instance that should be broken? added
Actions #3

Updated by okurz 7 months ago

  • Subtask #135122 added
Actions #4

Updated by okurz 6 months ago

  • Subtask #139010 added
Actions #5

Updated by okurz 29 days ago

  • Subtask #157669 added
Actions #6

Updated by okurz 29 days ago

  • Subtask #134924 added
Actions #7

Updated by okurz 29 days ago

  • Subtask deleted (#134924)
Actions #8

Updated by okurz 23 days ago

  • Subtask #158110 added
Actions #9

Updated by okurz 23 days ago

  • Subtask #158167 added
Actions

Also available in: Atom PDF