action #129487: high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #129487

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #108209: [epic] Reduce load on OSD

high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? size:M

Added by okurz about 2 years ago. Updated almost 2 years ago.

Status:

Rejected

Priority:

High

Assignee:

okurz

Category:

Feature requests

Target version:

Ready

Start date:

Due date:

% Done:

Estimated time:

Description

Motivation¶

OSD suffers from too many jobs uploading too much stuff in parallel (stuff could be "module results" or logs, I don't know). To prevent such situation overloading the instance so that also visitors of the webUI run into unresponsiveness we should limit the number of uploads but not invalidate job results or cause incomplete results, just serialize and queue the uploading.

Acceptance criteria¶

AC1: Based on configuration setting concurrent job upload handling can be limited
AC2: By default no limit (otherwise we would need "user feedback")

Suggestions¶

Marius assumes that mojolicious does not support that so will be very hard but can we use a semaphore or lock or "free upload ticket count" using the database?
The webUI could deny uploads if "upload ticket count" is exceeded, the worker understands that response, backs off and waits
"upload ticket count" could for a start be "all jobs in the running and uploading states" but as necessary every job going to uploading could increment a counter
The worker code already has "back off and retry later" code, e.g. when the webUI is rebooting+restarting, so we can just rely on the same and the webUI sends back the same, I dunno, 503 response?

Further details¶

Keep very much related #129619 in mind

Out of scope¶

Don't care about telling the worker something specific assuming it would automatically retry uploading anyway, same as when the webUI host reboots

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by okurz about 2 years ago

Copied from action #129484: high response times on osd - Move OSD workers to o3 to prevent OSD overload size:M added

Actions

Copy link

Updated by okurz about 2 years ago

Copied to action #129490: high response times on osd - Try nginx on o3 with enabled load limiting or load balancing features added

Actions

Copy link

Updated by okurz about 2 years ago

Copied to action #129619: high response times on osd - simple limit of jobs running concurrently in openQA size:M added

Actions

Copy link

Updated by okurz about 2 years ago

Subject changed from high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? to high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? size:M
Description updated (diff)

Actions

Copy link

Updated by okurz almost 2 years ago

Status changed from New to Workable

It's already estimated. I forgot to update status

Actions

Copy link

Updated by kraih almost 2 years ago

This is indeed very much related to #129619 and can probably be solved with a few nginx connection limit settings too. The only real concern is that the asset upload retry code might not be good enough yet.

Actions

Copy link

Updated by dheidler almost 2 years ago

Status changed from Workable to In Progress
Assignee set to dheidler

Actions

Copy link

Updated by openqa_review almost 2 years ago

Due date set to 2023-06-14

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by dheidler almost 2 years ago

Status changed from In Progress to Workable
Assignee changed from dheidler to kraih

Keeping this on hold as long as Sebastian is working on #129487 as this might make this ticket obsolete.

Actions

Copy link

#10

Updated by kraih almost 2 years ago

This nginx config should work for rate limiting artefact uploads:

limit_conn_zone $server_name zone=servers:10m;

upstream webui {
    zone upstream_webui 64k;
    server [::1]:9526 max_conns=30;
}

server {
    listen       8080;
    server_name  localhost;
    root /usr/share/openqa/public;
    client_max_body_size 0;

    location /api/v1/ws/ {
        proxy_pass http://[::1]:9527;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    location /liveviewhandler/ {
        proxy_pass http://[::1]:9528;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    # Limit artefact uploads to 50 connections
    location ~ ^/jobs/[0-9]+/artefact$ {
        proxy_pass "http://webui";
        limit_conn servers 50;

        proxy_set_header X-Rate-Limit-Test "true";
        proxy_set_header X-Forwarded-Host $host:$server_port;
        proxy_set_header X-Forwarded-Server $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }

    location / {
        proxy_pass "http://webui";
        proxy_set_header X-Forwarded-Host $host:$server_port;
        proxy_set_header X-Forwarded-Server $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}

Actions

Copy link

#11

Updated by kraih almost 2 years ago

Status changed from Workable to Blocked

With nginx deployment on OSD taking more time, i wonder if we should try rate limiting with a more hack-ish implementation in openQA itself after all. We have all the basic parts already via Minion locks. For the final/permanent production deployment i would strongly recommend nginx connection limits though.

Actions

Copy link

#12

Updated by kraih almost 2 years ago

Status changed from Blocked to Workable

Actions

Copy link

#13

Updated by okurz almost 2 years ago

Due date deleted (~~2023-06-14~~)
Status changed from Workable to Blocked
Assignee changed from kraih to okurz

I agree that we should first await full results from #129490 before we look into this ticket. I assume you wanted to block on that?

Actions

Copy link

#14

Updated by kraih almost 2 years ago

okurz wrote:

I agree that we should first await full results from #129490 before we look into this ticket. I assume you wanted to block on that?

Yea, and then i had an idea for a generic connection limit feature that could work on the Mojolicious routing level.

$job_r->limit_connections(50)->post('/artefact')->name('apiv1_create_artefact')->to('job#create_artefact');

It's actually pretty simple to implement and could be applied to any route with ->limit_connections(50). It would use a Minion shared lock in PostrgeSQL and respond with a 503 if the limit is exceeded. Basically emulating how nginx rate limits work in a less efficient way. So it would be useful to test the basic concept of limiting artefact uploads based on number of connections, but we'd throw the code away again in the future.

Actions

Copy link

#15

Updated by okurz almost 2 years ago

Nice idea and sounds promising but as you said we would maybe throw away the code again so that's why I suggest to wait for the nginx story.

Actions

Copy link

#16

Updated by kraih almost 2 years ago

okurz wrote:

Nice idea and sounds promising but as you said we would maybe throw away the code again so that's why I suggest to wait for the nginx story.

It might be worth trying because i still suspect that the whole concept will not work, because rejecting already uploaded chunks of data will cause extra bandwidth to be used. Or bad timing with rejections could cause times where available bandwidth is used less efficiently than it is now.

Actions

Copy link

#17

Updated by livdywan almost 2 years ago

okurz wrote:

I agree that we should first await full results from #129490 before we look into this ticket. I assume you wanted to block on that?

Clarifying for myself, but also commenting here to make it clear for others. The blocker has been resolved. There's #130636 which isn't on the backlog but we're effectively waiting on #131024 now.

Actions

Copy link

#18

Updated by livdywan almost 2 years ago

Blocked by action #131024: Ensure both nginx+apache are properly covered in packages+testing+documentation size:S added

Actions

Copy link

#19

Updated by okurz almost 2 years ago

Status changed from Blocked to Rejected

By now #131024 was resolved and nginx is pretty stable. With that I don't think we should implement this more complicated solution right now within openQA itself.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #129487

high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Further details¶

Out of scope¶

Updated by okurz about 2 years ago

Updated by okurz about 2 years ago

Updated by okurz about 2 years ago

Updated by okurz about 2 years ago

Updated by okurz almost 2 years ago

Updated by kraih almost 2 years ago

Updated by dheidler almost 2 years ago

Updated by openqa_review almost 2 years ago

Updated by dheidler almost 2 years ago

Updated by kraih almost 2 years ago

Updated by kraih almost 2 years ago

Updated by kraih almost 2 years ago

Updated by okurz almost 2 years ago

Updated by kraih almost 2 years ago

Updated by okurz almost 2 years ago

Updated by kraih almost 2 years ago

Updated by livdywan almost 2 years ago

Updated by livdywan almost 2 years ago

Updated by okurz almost 2 years ago