Project

General

Profile

Actions

action #129487

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #108209: [epic] Reduce load on OSD

high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? size:M

Added by okurz 11 months ago. Updated 9 months ago.

Status:
Rejected
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

OSD suffers from too many jobs uploading too much stuff in parallel (stuff could be "module results" or logs, I don't know). To prevent such situation overloading the instance so that also visitors of the webUI run into unresponsiveness we should limit the number of uploads but not invalidate job results or cause incomplete results, just serialize and queue the uploading.

Acceptance criteria

  • AC1: Based on configuration setting concurrent job upload handling can be limited
  • AC2: By default no limit (otherwise we would need "user feedback")

Suggestions

  • Marius assumes that mojolicious does not support that so will be very hard but can we use a semaphore or lock or "free upload ticket count" using the database?
  • The webUI could deny uploads if "upload ticket count" is exceeded, the worker understands that response, backs off and waits
  • "upload ticket count" could for a start be "all jobs in the running and uploading states" but as necessary every job going to uploading could increment a counter
  • The worker code already has "back off and retry later" code, e.g. when the webUI is rebooting+restarting, so we can just rely on the same and the webUI sends back the same, I dunno, 503 response?

Further details

  • Keep very much related #129619 in mind

Out of scope

  • Don't care about telling the worker something specific assuming it would automatically retry uploading anyway, same as when the webUI host reboots

Related issues 4 (0 open4 closed)

Blocked by openQA Project - action #131024: Ensure both nginx+apache are properly covered in packages+testing+documentation size:SResolveddheidler

Actions
Copied from openQA Infrastructure - action #129484: high response times on osd - Move OSD workers to o3 to prevent OSD overload size:MResolvedokurz2023-05-17

Actions
Copied to openQA Project - action #129490: high response times on osd - Try nginx on o3 with enabled load limiting or load balancing featuresResolvedkraih

Actions
Copied to openQA Project - action #129619: high response times on osd - simple limit of jobs running concurrently in openQA size:MResolvedtinita2023-05-20

Actions
Actions #1

Updated by okurz 11 months ago

  • Copied from action #129484: high response times on osd - Move OSD workers to o3 to prevent OSD overload size:M added
Actions #2

Updated by okurz 11 months ago

  • Copied to action #129490: high response times on osd - Try nginx on o3 with enabled load limiting or load balancing features added
Actions #3

Updated by okurz 11 months ago

  • Copied to action #129619: high response times on osd - simple limit of jobs running concurrently in openQA size:M added
Actions #4

Updated by okurz 11 months ago

  • Subject changed from high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? to high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? size:M
  • Description updated (diff)
Actions #5

Updated by okurz 11 months ago

  • Status changed from New to Workable

It's already estimated. I forgot to update status

Actions #6

Updated by kraih 11 months ago

This is indeed very much related to #129619 and can probably be solved with a few nginx connection limit settings too. The only real concern is that the asset upload retry code might not be good enough yet.

Actions #7

Updated by dheidler 11 months ago

  • Status changed from Workable to In Progress
  • Assignee set to dheidler
Actions #8

Updated by openqa_review 11 months ago

  • Due date set to 2023-06-14

Setting due date based on mean cycle time of SUSE QE Tools

Actions #9

Updated by dheidler 11 months ago

  • Status changed from In Progress to Workable
  • Assignee changed from dheidler to kraih

Keeping this on hold as long as Sebastian is working on #129487 as this might make this ticket obsolete.

Actions #10

Updated by kraih 10 months ago

This nginx config should work for rate limiting artefact uploads:

limit_conn_zone $server_name zone=servers:10m;

upstream webui {
    zone upstream_webui 64k;
    server [::1]:9526 max_conns=30;
}

server {
    listen       8080;
    server_name  localhost;
    root /usr/share/openqa/public;
    client_max_body_size 0;

    location /api/v1/ws/ {
        proxy_pass http://[::1]:9527;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    location /liveviewhandler/ {
        proxy_pass http://[::1]:9528;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    # Limit artefact uploads to 50 connections
    location ~ ^/jobs/[0-9]+/artefact$ {
        proxy_pass "http://webui";
        limit_conn servers 50;

        proxy_set_header X-Rate-Limit-Test "true";
        proxy_set_header X-Forwarded-Host $host:$server_port;
        proxy_set_header X-Forwarded-Server $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }

    location / {
        proxy_pass "http://webui";
        proxy_set_header X-Forwarded-Host $host:$server_port;
        proxy_set_header X-Forwarded-Server $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}
Actions #11

Updated by kraih 10 months ago

  • Status changed from Workable to Blocked

With nginx deployment on OSD taking more time, i wonder if we should try rate limiting with a more hack-ish implementation in openQA itself after all. We have all the basic parts already via Minion locks. For the final/permanent production deployment i would strongly recommend nginx connection limits though.

Actions #12

Updated by kraih 10 months ago

  • Status changed from Blocked to Workable
Actions #13

Updated by okurz 10 months ago

  • Due date deleted (2023-06-14)
  • Status changed from Workable to Blocked
  • Assignee changed from kraih to okurz

I agree that we should first await full results from #129490 before we look into this ticket. I assume you wanted to block on that?

Actions #14

Updated by kraih 10 months ago

okurz wrote:

I agree that we should first await full results from #129490 before we look into this ticket. I assume you wanted to block on that?

Yea, and then i had an idea for a generic connection limit feature that could work on the Mojolicious routing level.

$job_r->limit_connections(50)->post('/artefact')->name('apiv1_create_artefact')->to('job#create_artefact');

It's actually pretty simple to implement and could be applied to any route with ->limit_connections(50). It would use a Minion shared lock in PostrgeSQL and respond with a 503 if the limit is exceeded. Basically emulating how nginx rate limits work in a less efficient way. So it would be useful to test the basic concept of limiting artefact uploads based on number of connections, but we'd throw the code away again in the future.

Actions #15

Updated by okurz 10 months ago

Nice idea and sounds promising but as you said we would maybe throw away the code again so that's why I suggest to wait for the nginx story.

Actions #16

Updated by kraih 10 months ago

okurz wrote:

Nice idea and sounds promising but as you said we would maybe throw away the code again so that's why I suggest to wait for the nginx story.

It might be worth trying because i still suspect that the whole concept will not work, because rejecting already uploaded chunks of data will cause extra bandwidth to be used. Or bad timing with rejections could cause times where available bandwidth is used less efficiently than it is now.

Actions #17

Updated by livdywan 9 months ago

okurz wrote:

I agree that we should first await full results from #129490 before we look into this ticket. I assume you wanted to block on that?

Clarifying for myself, but also commenting here to make it clear for others. The blocker has been resolved. There's #130636 which isn't on the backlog but we're effectively waiting on #131024 now.

Actions #18

Updated by livdywan 9 months ago

  • Blocked by action #131024: Ensure both nginx+apache are properly covered in packages+testing+documentation size:S added
Actions #19

Updated by okurz 9 months ago

  • Status changed from Blocked to Rejected

By now #131024 was resolved and nginx is pretty stable. With that I don't think we should implement this more complicated solution right now within openQA itself.

Actions

Also available in: Atom PDF