Project

General

Profile

Actions

action #129487

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #108209: [epic] Reduce load on OSD

high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? size:M

Added by okurz 12 months ago. Updated 9 months ago.

Status:
Rejected
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

OSD suffers from too many jobs uploading too much stuff in parallel (stuff could be "module results" or logs, I don't know). To prevent such situation overloading the instance so that also visitors of the webUI run into unresponsiveness we should limit the number of uploads but not invalidate job results or cause incomplete results, just serialize and queue the uploading.

Acceptance criteria

  • AC1: Based on configuration setting concurrent job upload handling can be limited
  • AC2: By default no limit (otherwise we would need "user feedback")

Suggestions

  • Marius assumes that mojolicious does not support that so will be very hard but can we use a semaphore or lock or "free upload ticket count" using the database?
  • The webUI could deny uploads if "upload ticket count" is exceeded, the worker understands that response, backs off and waits
  • "upload ticket count" could for a start be "all jobs in the running and uploading states" but as necessary every job going to uploading could increment a counter
  • The worker code already has "back off and retry later" code, e.g. when the webUI is rebooting+restarting, so we can just rely on the same and the webUI sends back the same, I dunno, 503 response?

Further details

  • Keep very much related #129619 in mind

Out of scope

  • Don't care about telling the worker something specific assuming it would automatically retry uploading anyway, same as when the webUI host reboots

Related issues 4 (0 open4 closed)

Blocked by openQA Project - action #131024: Ensure both nginx+apache are properly covered in packages+testing+documentation size:SResolveddheidler

Actions
Copied from openQA Infrastructure - action #129484: high response times on osd - Move OSD workers to o3 to prevent OSD overload size:MResolvedokurz2023-05-17

Actions
Copied to openQA Project - action #129490: high response times on osd - Try nginx on o3 with enabled load limiting or load balancing featuresResolvedkraih

Actions
Copied to openQA Project - action #129619: high response times on osd - simple limit of jobs running concurrently in openQA size:MResolvedtinita2023-05-20

Actions
Actions

Also available in: Atom PDF