Project

General

Profile

Actions

action #129619

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #108209: [epic] Reduce load on OSD

high response times on osd - simple limit of jobs running concurrently in openQA size:M

Added by okurz about 1 year ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2023-05-20
Due date:
% Done:

0%

Estimated time:

Description

Motivation

OSD suffers from high response times or alerts about http responses. As it's likely due to too many jobs trying to upload concurrently we should introduce limits. Likely the easiest limit is on the number of jobs that the scheduler assigns to workers to prevent too many running in parallel

Acceptance criteria

  • AC1: openQA configuration options can limit the number of jobs that will be picked up at once
  • AC2: By default there is no limit

Suggestions

  • Look into the scheduler code, likely in lib/OpenQA/Scheduler/Model/Jobs.pm . Maybe it is possible to simply not assign any jobs to workers based on a config setting, if defined
  • Confirm in production, e.g. try it out on OSD
  • Come up with a good limit for osd

Further details

  • by default "no limit" because otherwise admins and users might be surprised if jobs are limited and they never configured something

Out of scope

  • Type of workers or type of jobs don't matter. Of course jobs with 10k job modules are more heavy but here we really focus on the number of jobs

Files


Related issues 6 (0 open6 closed)

Related to openQA Project - action #129490: high response times on osd - Try nginx on o3 with enabled load limiting or load balancing featuresResolvedkraih

Actions
Related to QA - action #130312: [tools] URL listing TW snapshots (and the changes therein), has stopped workingResolvedkraih2023-06-03

Actions
Related to openQA Infrastructure - action #134927: OSD throws 503, unresponsive for some minutes size:MResolvedokurz2023-08-31

Actions
Related to openQA Infrastructure - action #135632: "Mojo::File::spurt is deprecated in favor of Mojo::File::spew" breaking os-autoinst OBS build and osd-deployment size:MResolvedokurz2023-05-08

Actions
Blocks openQA Project - action #129481: Try to *reduce* number of apache workers to limit concurrent requests causing high CPU usageRejectedokurz

Actions
Copied from openQA Project - action #129487: high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? size:MRejectedokurz

Actions
Actions #1

Updated by okurz about 1 year ago

  • Copied from action #129487: high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? size:M added
Actions #2

Updated by okurz about 1 year ago

  • Subject changed from high response times on osd - simple limit of jobs running concurrently to high response times on osd - simple limit of jobs running concurrently size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by kraih about 1 year ago

Jan found an Nginx feature that would probably work for limiting concurrent uploads with the right location settings for the API endpoint. Log file upload retry should just work if rate limited, asset uploads are not so clear, there is a small possibility that they could get lost. So this needs to be tested.

Actions #4

Updated by kraih about 1 year ago

  • Assignee set to kraih
Actions #5

Updated by kraih about 1 year ago

  • Status changed from Workable to In Progress
Actions #6

Updated by kraih about 1 year ago

I'll do some local experiments to figure out the right nginx settings. For production deployment we'll have to consider other factors too though, such as logging with rotation and TLS certificates. So there will probably be a follow up ticket for proper nginx deployment, if this approach works out for the rate limiting.

Actions #7

Updated by openqa_review about 1 year ago

  • Due date set to 2023-06-14

Setting due date based on mean cycle time of SUSE QE Tools

Actions #8

Updated by livdywan about 1 year ago

  • Blocks action #129481: Try to *reduce* number of apache workers to limit concurrent requests causing high CPU usage added
Actions #9

Updated by kraih about 1 year ago

The nginx connection limits to appear to work exactly as we expected and are pretty straight forward to configure:

# Here we create two connection counters, one for all incoming
# connections, and one per IP address
limit_conn_zone $server_name zone=all:10m;
limit_conn_zone $binary_remote_addr zone=addr:10m;

server {
    listen       8080;
    server_name  localhost;
    root /usr/share/openqa/public;
    client_max_body_size 0;

    location /api/v1/ws/ {
        proxy_pass http://[::1]:9527;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    location /liveviewhandler/ {
        proxy_pass http://[::1]:9528;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    location /tests {
        # Here we limit the actual number of connections
        limit_conn all 4;
        limit_conn addr 2;

        proxy_set_header X-Forwarded-Host $host:$server_port;
        proxy_set_header X-Forwarded-Server $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_pass "http://[::1]:9526/tests";
    }

    location / {
        proxy_set_header X-Forwarded-Host $host:$server_port;
        proxy_set_header X-Forwarded-Server $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_pass "http://[::1]:9526";
    }
}

For testing purposes i've tried to only put limits on /tests, 4 connections overall from all IP addresses, and 2 connections per individual IP address. And this config does exactly that.

Actions #10

Updated by kraih about 1 year ago

And this one should do the trick for specifically rate limiting artefact uploads:

limit_conn_zone $server_name zone=servers:10m;

server {
    listen       8080;
    server_name  localhost;
    root /usr/share/openqa/public;
    client_max_body_size 0;

    location /api/v1/ws/ {
        proxy_pass http://[::1]:9527;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    location /liveviewhandler/ {
        proxy_pass http://[::1]:9528;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    location ~ ^/jobs/[0-9]+/artefact$ {
        limit_conn servers 50;

        proxy_set_header X-Rate-Limit-Test "true";
        proxy_set_header X-Forwarded-Host $host:$server_port;
        proxy_set_header X-Forwarded-Server $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_pass "http://[::1]:9526";
    }

    location / {
        proxy_set_header X-Forwarded-Host $host:$server_port;
        proxy_set_header X-Forwarded-Server $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_pass "http://[::1]:9526";
    }
}

With the rate limiting out of the way, now it is time to figure out all the other settings we would need for live deployment on O3/OSD.

Actions #11

Updated by kraih about 1 year ago

For starters i'm setting up an Nginx setup on O3 in parallel to the current Apache2. Then we can do a switch over on a not so busy day to see if everything works. There's a whole lot of small workarounds in the Apache config, so probably there will be some minor breakage, but shouldn't be anything that we can't fix quickly too.

Actions #12

Updated by kraih about 1 year ago

Nginx is now running on O3 on port 8080 and should work as an Apache replacement.Only remaining question is how we go about testing it in production without causing too serious disruptions.

Actions #13

Updated by kraih about 1 year ago

I've set it up so starting and stopping the apache2 and nginx services should be enough to switch between the two reverse proxies.

$ cat /etc/nginx/conf.d/openqa.conf
server {
    listen      80;
    server_name openqa.opensuse.org openqa.infra.opensuse.org;

    root /usr/share/openqa/public;

    client_max_body_size 0;
    client_body_buffer_size 64k;
    client_header_buffer_size 4k;

    location /nginx_status {
        stub_status;
        access_log off;
        allow 127.0.0.1;
        deny all;
    }

    location /assets {
        alias /var/lib/openqa/share/factory;
        tcp_nopush         on;
        sendfile           on;
        sendfile_max_chunk 1m;
    }

    location /image {
        alias /var/lib/openqa/images;
        tcp_nopush         on;
        sendfile           on;
        sendfile_max_chunk 1m;
    }

    location /api/v1/ws/ {
        proxy_pass http://127.0.0.1:9527;
        proxy_http_version 1.1;
        proxy_read_timeout 3600;
        proxy_send_timeout 3600;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    location /liveviewhandler/ {
        proxy_pass http://127.0.0.1:9528;
        proxy_http_version 1.1;
        proxy_read_timeout 3600;
        proxy_send_timeout 3600;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    location / {
        proxy_pass "http://127.0.0.1:9526";
        tcp_nodelay        on;
        proxy_read_timeout 900;
        proxy_send_timeout 900;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-Host $host:$server_port;
        proxy_set_header X-Forwarded-Server $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto "https";
    }

    access_log /space/logs/nginx/openqa.access_log;
    error_log /space/logs/nginx/openqa.error_log;
}

Edit: Some more tuning was required on O3. Enabling sendfile for asset downloads turned out to be extremely important. Without it the whole webui felt sluggish, because downloads were slowing everything else down.

Actions #14

Updated by kraih about 1 year ago

I've also set up logrotate based on the apache config.

/space/logs/nginx/openqa.access_log /space/logs/nginx/access_log {
    compress
    dateext
    delaycompress
    maxage 365
    rotate 10
    size=+4096k
    notifempty
    missingok
    create 644 root root
    sharedscripts
    postrotate
     systemctl reload nginx.service
     sleep 60
    endscript
}

/space/logs/nginx/openqa.error_log /space/logs/nginx/error_log {
    compress
    dateext
    delaycompress
    maxage 365
    rotate 10
    size=+4096k
    notifempty
    missingok
    create 644 root root
    sharedscripts
    postrotate
     systemctl reload nginx.service
     sleep 60
    endscript
}
Actions #15

Updated by okurz about 1 year ago

  • Related to action #129490: high response times on osd - Try nginx on o3 with enabled load limiting or load balancing features added
Actions #16

Updated by okurz about 1 year ago

  • Related to action #130312: [tools] URL listing TW snapshots (and the changes therein), has stopped working added
Actions #17

Updated by okurz about 1 year ago

Sounds great! Though I suggest to focus on nginx related work in #129490 and focus this ticket on a limit in pure openQA. In the end we might not want to continue this feature work in openQA if nginx helps us already to achieve the same as you have already found out

Actions #18

Updated by livdywan about 1 year ago

okurz wrote:

Sounds great! Though I suggest to focus on nginx related work in #129490 and focus this ticket on a limit in pure openQA. In the end we might not want to continue this feature work in openQA if nginx helps us already to achieve the same as you have already found out

That is exactly what we discussed and what Sebastian is doing here. It makes little sense to do the other way around.

Actions #19

Updated by okurz about 1 year ago

cdywan wrote:

That is exactly what we discussed and what Sebastian is doing here. It makes little sense to do the other way around.

I don't see that it makes little sense to do what this ticket suggests. But in any case please make the work explicit and transparent by using #129490 for nginx related work

Actions #20

Updated by livdywan about 1 year ago

okurz wrote:

cdywan wrote:

That is exactly what we discussed and what Sebastian is doing here. It makes little sense to do the other way around.

I don't see that it makes little sense to do what this ticket suggests. But in any case please make the work explicit and transparent by using #129490 for nginx related work

I was talking about looking into nginx first ;-) Not sure how the tickets got mixed up again. We also had some confusion due to #129487. It seems we need to take better care of tickets with overlapping goals.

Actions #21

Updated by livdywan about 1 year ago

  • Subject changed from high response times on osd - simple limit of jobs running concurrently size:M to high response times on osd - simple limit of jobs running concurrently in openQA size:M
  • Status changed from In Progress to Blocked

Let's swap the tickets accordingly. #129490 is the one to be evaluated first, hence blocking on that.

Actions #22

Updated by kraih about 1 year ago

okurz wrote:

In the end we might not want to continue this feature work in openQA if nginx helps us already to achieve the same as you have already found out

To sum up the situation so far: Nginx performs very well in production on O3 (with the right tuning), and it does have all the rate/connection limiting features we need for fine grained control over worker uploads. So it would make sense to rephrase the ACs of this ticket and to block it on #129490 as a prerequisite for connection limiting.

Actions #23

Updated by okurz about 1 year ago

kraih wrote:

So it would make sense to rephrase the ACs of this ticket and to block it on #129490 as a prerequisite for connection limiting.

I would rather put it like this: If we manage to use nginx to do all what we want to ensure that no openQA instance is overloaded then we don't need this openQA feature at all. It might still be slightly beneficial in non-nginx scenarios if there are any useful ones.

Actions #24

Updated by okurz about 1 year ago

  • Due date deleted (2023-06-14)

Discussed in weekly unblock 2023-06-07 and we will just wait for the nginx results before re-evaluating.

Actions #25

Updated by okurz 12 months ago

first #131024

Actions #26

Updated by livdywan 11 months ago

  • Subject changed from high response times on osd - simple limit of jobs running concurrently in openQA size:M to high response times on osd - simple limit of jobs running concurrently in openQA
  • Status changed from Blocked to New
  • Assignee deleted (kraih)

okurz wrote:

first #131024

The blocker was resolved. I'm resetting the ticket so we can re-evaluate what we want here

Actions #27

Updated by okurz 11 months ago

  • Subject changed from high response times on osd - simple limit of jobs running concurrently in openQA to high response times on osd - simple limit of jobs running concurrently in openQA size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #28

Updated by tinita 11 months ago

  • Status changed from Workable to In Progress
  • Assignee set to tinita
Actions #29

Updated by tinita 11 months ago

https://github.com/os-autoinst/openQA/pull/5276 Limit number of running jobs per webui instance

Actions #30

Updated by openqa_review 11 months ago

  • Due date set to 2023-08-23

Setting due date based on mean cycle time of SUSE QE Tools

Actions #31

Updated by tinita 11 months ago

https://github.com/os-autoinst/openQA/pull/5276 merged.
Another suggestion was to add a notice to the "All tests" page if the number of running jobs is currently at the limit, so instead of the normal "56 jobs are running" users would see something like "250 jobs are running (limit of running jobs reached)"

Actions #32

Updated by okurz 11 months ago

tinita wrote:

Another suggestion was to add a notice to the "All tests" page if the number of running jobs is currently at the limit, so instead of the normal "56 jobs are running" users would see something like "250 jobs are running (limit of running jobs reached)"

How about
"250 jobs are running (limited by server config)"

Actions #33

Updated by tinita 11 months ago

https://github.com/os-autoinst/openQA/pull/5279 Show max running jobs on /tests page

Actions #34

Updated by tinita 11 months ago

So I investigated the problem with max_running_jobs not always being set to the default value, causing warnings.
That is actually a problem of the tests, and it wouldn't happen in the productive code.
The problem is that apparently OpenQA::Setup::read_config is not always called when we create app instances in tests.

I think that's unfortunate because we have to cover for that in code by ensuring a default value in several places, although that should be done by OpenQA::Setup in just one place.
It would be good to make sure in tests we read the config and fill in defaults.

I will create a ticket because I had a quick look, and it looks to me like it's not fixable very fast: #134114

Actions #35

Updated by tinita 10 months ago

  • Status changed from In Progress to Feedback

I set max_running_jobs to 250 on osd and will monitor the job queue and CPU Load

Actions #36

Updated by tinita 10 months ago

  • Status changed from Feedback to In Progress

Somehow the limit had no effect. Will test locally...

Actions #37

Updated by tinita 10 months ago

  • Status changed from In Progress to Feedback

https://github.com/os-autoinst/openQA/pull/5285 Fix scheduler getting max_running_jobs config

So we have to use OpenQA::App->singleton->config. I had used $self->{config} because I found that code in the same module already.

This also explains some of the problems with the defaults.

I will look into that as part of #134114

Actions #38

Updated by tinita 10 months ago

https://github.com/os-autoinst/openQA/pull/5285 merged and deployed on osd, so limit should now be effective.

https://github.com/os-autoinst/openQA/pull/5287 Remove defaults, should be ensured by OpenQA::Setup already - merged, deployed on o3

Now monitoring what's happening

Actions #39

Updated by tinita 10 months ago

As discussed, the current implementation was too naive; I was just assuming that the schedule function would only get one job (or a group of jobs depending on each other) as a parameter.

What is happening now is the following:
https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=9&from=1692746389833&to=1692749719627

As soon as $running (the number of jobs in an EXECUTION state) is below the $limit, the schedule function will assign all currently scheduled jobs (with a maximum of MAX_JOB_ALLOCATION - 80 be default), and limited by free workers of course.
Then it will no longer schedule until $running is below $limit again.

The current logic is not what was intended.
It is not that bad though, because the newly assigned jobs (let's say 80) are just starting, and then one job after the other will finish, while no new jobs are assigned. That means the number of jobs in the phase of uploading at the same time will not be that high.

We could say that together with the MAX_JOB_ALLOCATION limit we are probably fine, however, we might want max_running_jobs to be a hard limit (with the exception of parallel clusters that should be assigned together).

Actions #40

Updated by tinita 10 months ago

  • Status changed from Feedback to In Progress

We decided now that we want max_running_jobs as a hard limit, even if the current combination of max_running_jobs and MAX_JOB_ALLOCATION might actually be the best usage of resources, but it's harder to explain to users what's happening exactly. (And MAX_JOB_ALLOCATION would have to be a configuration variable to easily adjust it, so it would also require a change).

Actions #41

Updated by tinita 10 months ago

  • Due date changed from 2023-08-30 to 2023-09-06

Due to putting out fires and other important tasks - bumping due date

Actions #43

Updated by tinita 10 months ago

  • Status changed from In Progress to Workable

I'm unassigning myself.
No programming possible for the last two weeks, only putting fires out, and it seems it goes on.

Actions #44

Updated by tinita 10 months ago

I was wondering why the schedule method takes two additional parameters, as I can't find any code under lib that passes these:

sub schedule ($self, $allocated_workers = {}, $allocated_jobs = {}) {

I found https://github.com/os-autoinst/openQA/pull/3741/files and it looks like the parameters were added to be able to check the contents of those hashes (or actually only $allocated_workers) in a unit test.

Since I'm extracting _allocate_jobs this might not be necessary anymore, and I can remove those parameters.

Actions #45

Updated by tinita 10 months ago

  • Status changed from Workable to In Progress
Actions #46

Updated by tinita 10 months ago

  • Due date changed from 2023-09-06 to 2023-09-13
Actions #47

Updated by okurz 10 months ago

  • Related to action #134927: OSD throws 503, unresponsive for some minutes size:M added
Actions #48

Updated by tinita 10 months ago

Still working on the test.
One problem is that 04-scheduler.t so far only had one worker, so there are no tests yet that deal with more than one job assignment.

Actions #49

Updated by tinita 10 months ago

Ready for review: https://github.com/os-autoinst/openQA/pull/5289 Make max_running_jobs a hard limit

Actions #50

Updated by tinita 10 months ago

  • Status changed from In Progress to Feedback
Actions #51

Updated by okurz 9 months ago

PR approved. As mentioned in https://github.com/os-autoinst/openQA/pull/5289#issuecomment-1713488199 please look into the runtime improvement of the unit test and further refactoring of the code as discussed.

Actions #52

Updated by livdywan 9 months ago

okurz wrote in #note-51:

PR approved. As mentioned in https://github.com/os-autoinst/openQA/pull/5289#issuecomment-1713488199 please look into the runtime improvement of the unit test and further refactoring of the code as discussed.

I would lean towards a follow-up ticket here as this has already been in the queue for a bit and probably won't be done today... of course we can discuss this in the Unblock.

Actions #53

Updated by okurz 9 months ago

  • Status changed from Feedback to In Progress

livdywan wrote in #note-52:

okurz wrote in #note-51:

PR approved. As mentioned in https://github.com/os-autoinst/openQA/pull/5289#issuecomment-1713488199 please look into the runtime improvement of the unit test and further refactoring of the code as discussed.

I would lean towards a follow-up ticket here as this has already been in the queue for a bit and probably won't be done today... of course we can discuss this in the Unblock.

No, refactoring and cleanup with no user-facing benefit must be part of the original work. There is no better time for refactoring than now. Of course as a team we should strive to support each other as much as possible to bring this to conclusion as soon as possible.

Actions #54

Updated by okurz 9 months ago

  • Due date changed from 2023-09-13 to 2023-09-29
  • Priority changed from High to Normal

The feature is effective in production and we can follow-up with the unit test improvements in the following days. Bumping the due-date due to unforeseen distractions I need to put on the shoulders of the team the past days.

Actions #55

Updated by tinita 9 months ago

https://github.com/os-autoinst/openQA/pull/5306 scheduler: Log statistics of rejected jobs (#135578)

Actions #56

Updated by tinita 9 months ago

  • Status changed from In Progress to Feedback
Actions #57

Updated by okurz 9 months ago

  • Due date deleted (2023-09-29)
  • Status changed from Feedback to Blocked

blocked by #135632

Actions #58

Updated by okurz 9 months ago

  • Related to action #135632: "Mojo::File::spurt is deprecated in favor of Mojo::File::spew" breaking os-autoinst OBS build and osd-deployment size:M added
Actions #60

Updated by tinita 9 months ago

  • Status changed from Blocked to Feedback

deployed yesterday.
/var/log/openqa_scheduler
osd:

[2023-09-25T10:59:56.015244+02:00] [debug] [pid:9919] Skipping 74 jobs because of no free workers for requested worker classes (qemu_ppc64le,tap:22,hmc_ppc64le-1disk:17,qemu_ppc64le-large-mem,tap:13,qemu_x86_64
,tap,worker31:8,virt-mm-64bit-ipmi:4,64bit-ipmi-large-mem:3,openqaworker16,qemu_x86_64,tap:3,64bit-ipmi-amd-zen3:1,generalhw_RPi3B:1,generalhw_RPi3B+:1,generalhw_RPi4:1)

o3:

[2023-09-25T09:00:39.060274Z] [debug] [pid:23000] Skipping 23 jobs because of no free workers for requested worker classes (qemu_ppc64le:15,qemu_aarch32,tap:6,s390x-zVM-vswitch-l2,tap:2)
Actions #61

Updated by tinita 9 months ago

  • Status changed from Feedback to In Progress
Actions #62

Updated by tinita 9 months ago

https://github.com/os-autoinst/openQA/pull/5315 Reduce runtime of t/04-scheduler.t

Actions #63

Updated by openqa_review 9 months ago

  • Due date set to 2023-10-12

Setting due date based on mean cycle time of SUSE QE Tools

Actions #64

Updated by okurz 9 months ago

Actions #65

Updated by tinita 9 months ago

  • Status changed from In Progress to Resolved
Actions #66

Updated by okurz 9 months ago

  • Due date deleted (2023-10-12)
Actions

Also available in: Atom PDF