Project

General

Profile

Actions

coordination #108209

open

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

[epic] Reduce load on OSD

Added by okurz over 2 years ago. Updated about 14 hours ago.

Status:
Blocked
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2023-04-01
Due date:
2024-07-18 (Due in 6 days)
% Done:

85%

Estimated time:
(Total: 0.00 h)

Description

Motivation

See #107875

Ideas

  • Look into cumulative CPU usage to decide where to optimize first
  • Look up old ticket from kraih about reverse proxy for postgres -> #55262
  • Experiment with using nginx instead of apache
  • Log to remote target, e.g. apache logs, and only evaluate there
  • Use remote postgres database
  • Review other intervals in telegraf

Subtasks 28 (4 open24 closed)

openQA Infrastructure - action #128789: [alert] Apache Response Time alert size:MResolvednicksinger2023-04-01

Actions
action #129481: Try to *reduce* number of apache workers to limit concurrent requests causing high CPU usageRejectedokurz

Actions
openQA Infrastructure - action #129484: high response times on osd - Move OSD workers to o3 to prevent OSD overload size:MResolvedokurz2023-05-17

Actions
action #129487: high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? size:MRejectedokurz

Actions
action #129490: high response times on osd - Try nginx on o3 with enabled load limiting or load balancing featuresResolvedkraih

Actions
openQA Infrastructure - action #129493: high response times on osd - better nice level for velociraptorResolvedokurz

Actions
action #129619: high response times on osd - simple limit of jobs running concurrently in openQA size:MResolvedtinita2023-05-20

Actions
action #129745: Enable apache response time alert and apache log alert again after we think it's good now size:MResolvedokurz2023-05-23

Actions
action #130477: [O3]http connection to O3 repo is broken sporadically in virtualization tests, likely due to systemd dependencies on apache/nginx size:MResolvedmkittler2023-06-07

Actions
action #130636: high response times on osd - Try nginx on OSD size:SResolvedmkittler2024-05-17

Actions
action #131024: Ensure both nginx+apache are properly covered in packages+testing+documentation size:SResolveddheidler

Actions
openQA Infrastructure - action #133325: osd http response alerts - bump threshold further upRejectedokurz2023-07-25

Actions
openQA Infrastructure - action #133397: HTTP Response alert Salt alerting and autoresolving shortly size:MResolvedmkittler2023-07-26

Actions
action #134114: Ensure to call OpenQA::Setup::read_config in unit testsResolvedtinita

Actions
openQA Infrastructure - action #157081: OSD unresponsive or significantly slow for some minutes 2024-03-12 08:30ZResolvedokurz2024-03-12

Actions
openQA Infrastructure - action #157666: OSD unresponsive and then not starting any more jobs on 2024-03-21Resolvedokurz2024-03-12

Actions
openQA Infrastructure - action #157726: osd-deployment | Failed pipeline for master (worker3[6-9].oqa.prg2.suse.org)Blockedokurz2024-03-18

Actions
openQA Infrastructure - action #158059: OSD unresponsive or significantly slow for some minutes 2024-03-26 13:34ZResolvedokurz

Actions
openQA Infrastructure - action #159396: Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:MResolveddheidler2024-07-16

Actions
action #159651: high response times on osd - nginx with enabled rate limiting features size:SRejectedokurz2024-04-262024-06-14

Actions
action #159654: high response times on osd - nginx properly monitored in grafana size:SResolvedjbaier_cz2024-04-26

Actions
openQA Infrastructure - action #160239: [alert] External http responses Salt (https://openqa.suse.de/health) due to "Too many open files" after switch to nginxResolvedokurz2024-05-12

Actions
openQA Infrastructure - action #160367: After switch to nginx on OSD let's investigate how system performance was impactedResolvedokurz2024-05-14

Actions
openQA Infrastructure - action #160478: Try out higher global openQA job limit on OSD again after switch to nginx size:SResolvedokurz2023-08-31

Actions
action #160877: [alert] Scripts CI pipeline failing due to osd yielding 502 size:MResolvedmkittler2024-05-24

Actions
action #162533: [alert] OSD nginx yields 502 responses rather than being more resilient of e.g. openqa-webui restarts size:SBlockedokurz2024-05-24

Actions
action #162611: Easy local development setup for comparing apache2+nginx as openQA web proxy size:SWorkablelivdywan2024-05-242024-07-18

Actions
action #162614: Consider other options how to restart openqa-webui to prevent 502's responses by nginxNew2024-05-24

Actions

Related issues 3 (1 open2 closed)

Related to openQA Project - coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alertResolvedokurz2023-09-07

Actions
Copied from openQA Infrastructure - action #107875: [alert][osd] Apache Response Time alert size:MResolvedtinita2022-03-042022-03-24

Actions
Copied to openQA Project - coordination #158167: [epic] Increase worker capacityNewokurz2024-03-27

Actions
Actions #1

Updated by okurz over 2 years ago

  • Copied from action #107875: [alert][osd] Apache Response Time alert size:M added
Actions #2

Updated by okurz over 2 years ago

  • Description updated (diff)
Actions #3

Updated by okurz about 1 year ago

  • Tracker changed from action to coordination
  • Project changed from openQA Infrastructure to openQA Project
  • Subject changed from Reduce load on OSD to [epic] Reduce load on OSD
  • Category set to Feature requests
Actions #4

Updated by okurz about 1 year ago

  • Parent task set to #110833
Actions #5

Updated by okurz about 1 year ago

  • Status changed from New to Blocked
  • Assignee set to okurz
  • Target version changed from future to Ready
Actions #6

Updated by okurz 10 months ago

  • Related to coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert added
Actions #7

Updated by okurz 10 months ago

  • Target version changed from Ready to Tools - Next
Actions #8

Updated by okurz 8 months ago

  • Target version changed from Tools - Next to future
Actions #9

Updated by okurz 4 months ago

  • Subtask #157081 added
Actions #10

Updated by okurz 4 months ago

  • Subtask #157666 added
Actions #11

Updated by okurz 4 months ago

  • Subtask #157726 added
Actions #12

Updated by okurz 4 months ago

  • Subtask #158059 added
Actions #13

Updated by okurz 4 months ago

Actions #14

Updated by okurz 3 months ago

  • Subtask #159396 added
Actions #15

Updated by okurz 3 months ago

  • Subtask #159651 added
Actions #16

Updated by okurz 3 months ago

  • Subtask #159654 added
Actions #17

Updated by okurz about 2 months ago

  • Subtask #160239 added
Actions #18

Updated by okurz about 2 months ago

  • Subtask #160367 added
Actions #19

Updated by okurz about 2 months ago

  • Subtask #160478 added
Actions #20

Updated by okurz 22 days ago

  • Subtask #162533 added
Actions #21

Updated by okurz 22 days ago

  • Subtask #162614 added
Actions #22

Updated by okurz 22 days ago

  • Subtask #160877 added
Actions #23

Updated by okurz 22 days ago

  • Subtask #162611 added
Actions

Also available in: Atom PDF