Project

General

Profile

Actions

coordination #110833

open

[saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

Added by okurz over 2 years ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2022-05-09
Due date:
% Done:

80%

Estimated time:
(Total: 0.00 h)

Description

Motivation

Ideas

  • Test locally by scheduling something like 100k jobs and see how the scheduler scales
  • Test locally by scheduling many jobs on something like 1k worker instances and see how the scheduler scales
  • Note that there's a unit test for scalability which one might simply invoke with very high numbers for scheduled jobs and available workers

Subtasks 60 (13 open47 closed)

coordination #108209: [epic] Reduce load on OSDResolvedokurz2023-04-01

Actions
openQA Infrastructure (public) - action #128789: [alert] Apache Response Time alert size:MResolvednicksinger2023-04-01

Actions
action #129481: Try to *reduce* number of apache workers to limit concurrent requests causing high CPU usageRejectedokurz

Actions
openQA Infrastructure (public) - action #129484: high response times on osd - Move OSD workers to o3 to prevent OSD overload size:MResolvedokurz2023-05-17

Actions
action #129487: high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? size:MRejectedokurz

Actions
action #129490: high response times on osd - Try nginx on o3 with enabled load limiting or load balancing featuresResolvedkraih

Actions
openQA Infrastructure (public) - action #129493: high response times on osd - better nice level for velociraptorResolvedokurz

Actions
action #129619: high response times on osd - simple limit of jobs running concurrently in openQA size:MResolvedtinita2023-05-20

Actions
action #129745: Enable apache response time alert and apache log alert again after we think it's good now size:MResolvedokurz2023-05-23

Actions
action #130477: [O3]http connection to O3 repo is broken sporadically in virtualization tests, likely due to systemd dependencies on apache/nginx size:MResolvedmkittler2023-06-07

Actions
action #130636: high response times on osd - Try nginx on OSD size:SResolvedmkittler

Actions
action #131024: Ensure both nginx+apache are properly covered in packages+testing+documentation size:SResolveddheidler

Actions
openQA Infrastructure (public) - action #133325: osd http response alerts - bump threshold further upRejectedokurz2023-07-25

Actions
openQA Infrastructure (public) - action #133397: HTTP Response alert Salt alerting and autoresolving shortly size:MResolvedmkittler2023-07-26

Actions
action #134114: Ensure to call OpenQA::Setup::read_config in unit testsResolvedtinita

Actions
openQA Infrastructure (public) - action #157081: OSD unresponsive or significantly slow for some minutes 2024-03-12 08:30ZResolvedokurz2024-03-12

Actions
openQA Infrastructure (public) - action #157666: OSD unresponsive and then not starting any more jobs on 2024-03-21Resolvedokurz2024-03-12

Actions
openQA Infrastructure (public) - action #157726: osd-deployment | Failed pipeline for master (worker3[6-9].oqa.prg2.suse.org)Resolvedokurz2024-03-18

Actions
openQA Infrastructure (public) - action #158059: OSD unresponsive or significantly slow for some minutes 2024-03-26 13:34ZResolvedokurz

Actions
openQA Infrastructure (public) - action #159396: Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:MResolveddheidler

Actions
action #159651: high response times on osd - nginx with enabled rate limiting features size:SRejectedokurz2024-04-26

Actions
action #159654: high response times on osd - nginx properly monitored in grafana size:SResolvedjbaier_cz2024-04-26

Actions
openQA Infrastructure (public) - action #160239: [alert] External http responses Salt (https://openqa.suse.de/health) due to "Too many open files" after switch to nginxResolvedokurz2024-05-12

Actions
openQA Infrastructure (public) - action #160367: After switch to nginx on OSD let's investigate how system performance was impactedResolvedokurz2024-05-14

Actions
openQA Infrastructure (public) - action #160478: Try out higher global openQA job limit on OSD again after switch to nginx size:SResolvedokurz2023-08-31

Actions
action #160877: [alert] Scripts CI pipeline failing due to osd yielding 502 size:MResolvedmkittler2024-05-24

Actions
action #162533: [alert] OSD nginx yields 502 responses rather than being more resilient of e.g. openqa-webui restarts size:SResolvedmkittler2024-05-24

Actions
action #162611: Easy local development setup for comparing apache2+nginx as openQA web proxy size:SResolvedlivdywan2024-05-24

Actions
action #162614: Consider other options how to restart openqa-webui to prevent 502's responses by nginxRejectedokurz2024-05-24

Actions
action #110785: OSD incident 2022-05-09: Many scheduled jobs not picked up despite idle workers, blocked by one worker instance that should be broken?Resolvedmkittler2022-05-09

Actions
coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alertResolvedokurz2023-09-07

Actions
openQA Infrastructure (public) - action #135329: s390x work demand exceeds available workersResolvedokurz2023-09-07

Actions
action #135362: Optimize worker status update handling in websocket server size:MResolvedkraih2023-09-07

Actions
openQA Infrastructure (public) - action #135380: A significant number of scheduled jobs with one or two running triggers an alertResolvedokurz2023-09-07

Actions
action #135407: [tools] Measure to mitigate websockets overload by workers and revert it size:MResolvedlivdywan2023-09-08

Actions
action #135482: Move to systemd journal only on o3+osd (was: Missing openqa_websockets log file on OSD for websocket server) size:MRejectedokurz2023-09-11

Actions
openQA Infrastructure (public) - action #135578: Long job age and jobs not executed for long size:MResolvednicksinger

Actions
coordination #139010: [epic] Long OSD ppc64le job queueBlockedokurz2023-11-04

Actions
openQA Infrastructure (public) - action #139100: Long OSD ppc64le job queue - Move nue3 power8 machines to nue2Resolvedokurz2023-11-04

Actions
openQA Infrastructure (public) - action #139103: Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:MResolvedokurz2023-11-04

Actions
openQA Infrastructure (public) - action #166802: Recover worker37, worker38, worker39 size:SBlockedokurz

Actions
coordination #157669: websockets+scheduler improvements to support more online worker instancesNew2023-08-31

Actions
action #134924: Websocket server overloaded, affected worker slots shown as "broken" with graceful disconnect in workers tableNew2023-08-31

Actions
action #157675: Optimize openqa-scheduler database queries, e.g. "SELECT value FROM worker_properties..."New2024-03-21

Actions
action #157681: Profiling using NYTProf for openqa-websockets and openqa-schedulerNew2024-03-21

Actions
action #157684: cycle execution health check in openqa-schedulerNew2024-03-21

Actions
action #157690: Simple global limit of registered/online workers size:MResolvedmkittler2024-03-21

Actions
openQA Infrastructure (public) - action #167557: OSD not starting new jobs on 2024-09-28 due to >1k worker instances connected, overloading websocket serverResolvedokurz2024-09-28

Actions
action #168178: Limit connected online workers based on websocket+scheduler load size:MWorkable

Actions
action #168502: Check for high websockets load on o3 2024-10-20Resolvedokurz2024-10-20

Actions
coordination #158110: [epic] Prevent worker overloadNew2024-03-27

Actions
openQA Infrastructure (public) - action #158104: typing issue on ppc64 worker size:SResolvedokurz2024-03-27

Actions
openQA Infrastructure (public) - action #158113: typing issue on ppc64 worker - make CPU load alert more strict size:MResolvedokurz2024-03-27

Actions
openQA Infrastructure (public) - action #158116: typing issue on ppc64 worker - crosscheck performance impact of ffmpeg on ppc64le (Power8 kvm) size:MWorkable2024-03-27

Actions
action #158125: typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured threshold size:MResolvedmkittler

Actions
openQA Infrastructure (public) - action #158709: typing issue on ppc64 worker - with automatic CPU load based limiting in place let's increase the instances on mania againNew

Actions
action #158910: typing issue on ppc64 worker - reconsider number of worker instances in particular on ppc64le kvm tests size:MBlockedokurz

Actions
action #168244: reconsider load calculation for worker load limit especially for ppc size:SResolvedokurz

Actions
coordination #158167: [epic] Increase worker capacityNewokurz2024-03-27

Actions
openQA Infrastructure (public) - action #158170: Increase resources for s390x kvm size:MResolvednicksinger2024-03-27

Actions

Related issues 2 (1 open1 closed)

Copied from openQA Project (public) - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old resultsResolvedokurz2020-03-18

Actions
Copied to QA (public) - coordination #164466: [saga][epic] Scale up: Hyper-responsive openQA webUINew2024-07-262024-10-02

Actions
Actions #1

Updated by okurz over 2 years ago

  • Copied from coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results added
Actions #2

Updated by okurz over 2 years ago

  • Related to action #110785: OSD incident 2022-05-09: Many scheduled jobs not picked up despite idle workers, blocked by one worker instance that should be broken? added
Actions #3

Updated by okurz over 1 year ago

  • Subtask #135122 added
Actions #4

Updated by okurz about 1 year ago

  • Subtask #139010 added
Actions #5

Updated by okurz 9 months ago

  • Subtask #157669 added
Actions #6

Updated by okurz 9 months ago

  • Subtask #134924 added
Actions #7

Updated by okurz 9 months ago

  • Subtask deleted (#134924)
Actions #8

Updated by okurz 9 months ago

  • Subtask #158110 added
Actions #9

Updated by okurz 9 months ago

  • Subtask #158167 added
Actions #10

Updated by okurz 5 months ago

Actions #11

Updated by okurz 3 months ago

  • Related to action #167557: OSD not starting new jobs on 2024-09-28 due to >1k worker instances connected, overloading websocket server added
Actions

Also available in: Atom PDF