QA (public) &raquo; openQA Project (public)

Related to QA (public) - action #130312: [tools] URL listing TW snapshots (and the changes therein), has stopped working

Resolved

kraih

2023-06-03

Related to openQA Infrastructure (public) - action #134927: OSD throws 503, unresponsive for some minutes size:M

Resolved

2023-08-31

Related to openQA Infrastructure (public) - action #135632: "Mojo::File::spurt is deprecated in favor of Mojo::File::spew" breaking os-autoinst OBS build and osd-deployment size:M

Resolved

2023-05-08

Blocks openQA Project (public) - action #129481: Try to *reduce* number of apache workers to limit concurrent requests causing high CPU usage

Rejected

Copied from openQA Project (public) - action #129487: high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? size:M

Rejected

Updated by okurz about 2 years ago

Copied from action #129487: high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? size:M added

Actions

Updated by okurz about 2 years ago

Subject changed from high response times on osd - simple limit of jobs running concurrently to high response times on osd - simple limit of jobs running concurrently size:M
Description updated (diff)
Status changed from New to Workable

Actions

Updated by kraih almost 2 years ago

Jan found an Nginx feature that would probably work for limiting concurrent uploads with the right location settings for the API endpoint. Log file upload retry should just work if rate limited, asset uploads are not so clear, there is a small possibility that they could get lost. So this needs to be tested.

Actions

Updated by kraih almost 2 years ago

Assignee set to kraih

Actions

Updated by kraih almost 2 years ago

Status changed from Workable to In Progress

Actions

Updated by kraih almost 2 years ago

I'll do some local experiments to figure out the right nginx settings. For production deployment we'll have to consider other factors too though, such as logging with rotation and TLS certificates. So there will probably be a follow up ticket for proper nginx deployment, if this approach works out for the rate limiting.

Actions

Updated by openqa_review almost 2 years ago

Due date set to 2023-06-14

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Updated by livdywan almost 2 years ago

Blocks action #129481: Try to *reduce* number of apache workers to limit concurrent requests causing high CPU usage added

Actions

Updated by kraih almost 2 years ago

The nginx connection limits to appear to work exactly as we expected and are pretty straight forward to configure:

# Here we create two connection counters, one for all incoming
# connections, and one per IP address
limit_conn_zone $server_name zone=all:10m;
limit_conn_zone $binary_remote_addr zone=addr:10m;

server {
    listen       8080;
    server_name  localhost;
    root /usr/share/openqa/public;
    client_max_body_size 0;

    location /api/v1/ws/ {
        proxy_pass http://[::1]:9527;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    location /liveviewhandler/ {
        proxy_pass http://[::1]:9528;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    location /tests {
        # Here we limit the actual number of connections
        limit_conn all 4;
        limit_conn addr 2;

        proxy_set_header X-Forwarded-Host $host:$server_port;
        proxy_set_header X-Forwarded-Server $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_pass "http://[::1]:9526/tests";
    }

    location / {
        proxy_set_header X-Forwarded-Host $host:$server_port;
        proxy_set_header X-Forwarded-Server $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_pass "http://[::1]:9526";
    }
}

For testing purposes i've tried to only put limits on /tests, 4 connections overall from all IP addresses, and 2 connections per individual IP address. And this config does exactly that.

Actions

#10

Updated by kraih almost 2 years ago

And this one should do the trick for specifically rate limiting artefact uploads:

limit_conn_zone $server_name zone=servers:10m;

server {
    listen       8080;
    server_name  localhost;
    root /usr/share/openqa/public;
    client_max_body_size 0;

    location /api/v1/ws/ {
        proxy_pass http://[::1]:9527;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    location /liveviewhandler/ {
        proxy_pass http://[::1]:9528;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    location ~ ^/jobs/[0-9]+/artefact$ {
        limit_conn servers 50;

        proxy_set_header X-Rate-Limit-Test "true";
        proxy_set_header X-Forwarded-Host $host:$server_port;
        proxy_set_header X-Forwarded-Server $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_pass "http://[::1]:9526";
    }

    location / {
        proxy_set_header X-Forwarded-Host $host:$server_port;
        proxy_set_header X-Forwarded-Server $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_pass "http://[::1]:9526";
    }
}

With the rate limiting out of the way, now it is time to figure out all the other settings we would need for live deployment on O3/OSD.

Actions

#11

Updated by kraih almost 2 years ago

For starters i'm setting up an Nginx setup on O3 in parallel to the current Apache2. Then we can do a switch over on a not so busy day to see if everything works. There's a whole lot of small workarounds in the Apache config, so probably there will be some minor breakage, but shouldn't be anything that we can't fix quickly too.

Actions

#12

Updated by kraih almost 2 years ago

Nginx is now running on O3 on port 8080 and should work as an Apache replacement.Only remaining question is how we go about testing it in production without causing too serious disruptions.

Actions

#13

Updated by kraih almost 2 years ago

I've set it up so starting and stopping the apache2 and nginx services should be enough to switch between the two reverse proxies.

$ cat /etc/nginx/conf.d/openqa.conf
server {
    listen      80;
    server_name openqa.opensuse.org openqa.infra.opensuse.org;

    root /usr/share/openqa/public;

    client_max_body_size 0;
    client_body_buffer_size 64k;
    client_header_buffer_size 4k;

    location /nginx_status {
        stub_status;
        access_log off;
        allow 127.0.0.1;
        deny all;
    }

    location /assets {
        alias /var/lib/openqa/share/factory;
        tcp_nopush         on;
        sendfile           on;
        sendfile_max_chunk 1m;
    }

    location /image {
        alias /var/lib/openqa/images;
        tcp_nopush         on;
        sendfile           on;
        sendfile_max_chunk 1m;
    }

    location /api/v1/ws/ {
        proxy_pass http://127.0.0.1:9527;
        proxy_http_version 1.1;
        proxy_read_timeout 3600;
        proxy_send_timeout 3600;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    location /liveviewhandler/ {
        proxy_pass http://127.0.0.1:9528;
        proxy_http_version 1.1;
        proxy_read_timeout 3600;
        proxy_send_timeout 3600;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    location / {
        proxy_pass "http://127.0.0.1:9526";
        tcp_nodelay        on;
        proxy_read_timeout 900;
        proxy_send_timeout 900;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-Host $host:$server_port;
        proxy_set_header X-Forwarded-Server $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto "https";
    }

    access_log /space/logs/nginx/openqa.access_log;
    error_log /space/logs/nginx/openqa.error_log;
}

Edit: Some more tuning was required on O3. Enabling sendfile for asset downloads turned out to be extremely important. Without it the whole webui felt sluggish, because downloads were slowing everything else down.

Actions

#14

Updated by kraih almost 2 years ago

I've also set up logrotate based on the apache config.

/space/logs/nginx/openqa.access_log /space/logs/nginx/access_log {
    compress
    dateext
    delaycompress
    maxage 365
    rotate 10
    size=+4096k
    notifempty
    missingok
    create 644 root root
    sharedscripts
    postrotate
     systemctl reload nginx.service
     sleep 60
    endscript
}

/space/logs/nginx/openqa.error_log /space/logs/nginx/error_log {
    compress
    dateext
    delaycompress
    maxage 365
    rotate 10
    size=+4096k
    notifempty
    missingok
    create 644 root root
    sharedscripts
    postrotate
     systemctl reload nginx.service
     sleep 60
    endscript
}

Actions

#15

Updated by okurz almost 2 years ago

Related to action #129490: high response times on osd - Try nginx on o3 with enabled load limiting or load balancing features added

Actions

#16

Updated by okurz almost 2 years ago

Related to action #130312: [tools] URL listing TW snapshots (and the changes therein), has stopped working added

Actions

#17

Updated by okurz almost 2 years ago

Sounds great! Though I suggest to focus on nginx related work in #129490 and focus this ticket on a limit in pure openQA. In the end we might not want to continue this feature work in openQA if nginx helps us already to achieve the same as you have already found out

Actions

#18

Updated by livdywan almost 2 years ago

okurz wrote:

Sounds great! Though I suggest to focus on nginx related work in #129490 and focus this ticket on a limit in pure openQA. In the end we might not want to continue this feature work in openQA if nginx helps us already to achieve the same as you have already found out

That is exactly what we discussed and what Sebastian is doing here. It makes little sense to do the other way around.

Actions

#19

Updated by okurz almost 2 years ago

cdywan wrote:

That is exactly what we discussed and what Sebastian is doing here. It makes little sense to do the other way around.

I don't see that it makes little sense to do what this ticket suggests. But in any case please make the work explicit and transparent by using #129490 for nginx related work

Actions

#20

Updated by livdywan almost 2 years ago

okurz wrote:

cdywan wrote:

That is exactly what we discussed and what Sebastian is doing here. It makes little sense to do the other way around.

I don't see that it makes little sense to do what this ticket suggests. But in any case please make the work explicit and transparent by using #129490 for nginx related work

I was talking about looking into nginx first ;-) Not sure how the tickets got mixed up again. We also had some confusion due to #129487. It seems we need to take better care of tickets with overlapping goals.

Actions

#21

Updated by livdywan almost 2 years ago

Subject changed from high response times on osd - simple limit of jobs running concurrently size:M to high response times on osd - simple limit of jobs running concurrently in openQA size:M
Status changed from In Progress to Blocked

Let's swap the tickets accordingly. #129490 is the one to be evaluated first, hence blocking on that.

Actions

#22

Updated by kraih almost 2 years ago

okurz wrote:

In the end we might not want to continue this feature work in openQA if nginx helps us already to achieve the same as you have already found out

To sum up the situation so far: Nginx performs very well in production on O3 (with the right tuning), and it does have all the rate/connection limiting features we need for fine grained control over worker uploads. So it would make sense to rephrase the ACs of this ticket and to block it on #129490 as a prerequisite for connection limiting.

Actions

#23

Updated by okurz almost 2 years ago

kraih wrote:

So it would make sense to rephrase the ACs of this ticket and to block it on #129490 as a prerequisite for connection limiting.

I would rather put it like this: If we manage to use nginx to do all what we want to ensure that no openQA instance is overloaded then we don't need this openQA feature at all. It might still be slightly beneficial in non-nginx scenarios if there are any useful ones.

Actions

#24

Updated by okurz almost 2 years ago

Due date deleted (~~2023-06-14~~)

Discussed in weekly unblock 2023-06-07 and we will just wait for the nginx results before re-evaluating.

Actions

#25

Updated by okurz almost 2 years ago

first #131024

Actions

#26

Updated by livdywan almost 2 years ago

Subject changed from high response times on osd - simple limit of jobs running concurrently in openQA size:M to high response times on osd - simple limit of jobs running concurrently in openQA
Status changed from Blocked to New
Assignee deleted (~~kraih~~)

okurz wrote:

first #131024

The blocker was resolved. I'm resetting the ticket so we can re-evaluate what we want here

Actions

#27

Updated by okurz almost 2 years ago

Subject changed from high response times on osd - simple limit of jobs running concurrently in openQA to high response times on osd - simple limit of jobs running concurrently in openQA size:M
Description updated (diff)
Status changed from New to Workable

Actions

#28

Updated by tinita almost 2 years ago

Status changed from Workable to In Progress
Assignee set to tinita

Actions

#29

Updated by tinita almost 2 years ago

https://github.com/os-autoinst/openQA/pull/5276 Limit number of running jobs per webui instance

Actions

#30

Updated by openqa_review almost 2 years ago

Due date set to 2023-08-23

Setting due date based on mean cycle time of SUSE QE Tools

Actions

#31

Updated by tinita almost 2 years ago

https://github.com/os-autoinst/openQA/pull/5276 merged.
Another suggestion was to add a notice to the "All tests" page if the number of running jobs is currently at the limit, so instead of the normal "56 jobs are running" users would see something like "250 jobs are running (limit of running jobs reached)"

Actions

#32

Updated by okurz almost 2 years ago

tinita wrote:

Another suggestion was to add a notice to the "All tests" page if the number of running jobs is currently at the limit, so instead of the normal "56 jobs are running" users would see something like "250 jobs are running (limit of running jobs reached)"

How about
"250 jobs are running (limited by server config)"

Actions

#33

Updated by tinita almost 2 years ago

https://github.com/os-autoinst/openQA/pull/5279 Show max running jobs on /tests page

Actions

#34

Updated by tinita almost 2 years ago

So I investigated the problem with max_running_jobs not always being set to the default value, causing warnings.
That is actually a problem of the tests, and it wouldn't happen in the productive code.
The problem is that apparently OpenQA::Setup::read_config is not always called when we create app instances in tests.

I think that's unfortunate because we have to cover for that in code by ensuring a default value in several places, although that should be done by OpenQA::Setup in just one place.
It would be good to make sure in tests we read the config and fill in defaults.

I will create a ticket because I had a quick look, and it looks to me like it's not fixable very fast: #134114

Actions

#35

Updated by tinita almost 2 years ago

Status changed from In Progress to Feedback

I set max_running_jobs to 250 on osd and will monitor the job queue and CPU Load

Actions

#36

Updated by tinita almost 2 years ago

Status changed from Feedback to In Progress

Somehow the limit had no effect. Will test locally...

Actions

#37

Updated by tinita almost 2 years ago

Status changed from In Progress to Feedback

https://github.com/os-autoinst/openQA/pull/5285 Fix scheduler getting max_running_jobs config

So we have to use OpenQA::App->singleton->config. I had used $self->{config} because I found that code in the same module already.

This also explains some of the problems with the defaults.

I will look into that as part of #134114

Actions

#38

Updated by tinita almost 2 years ago

https://github.com/os-autoinst/openQA/pull/5285 merged and deployed on osd, so limit should now be effective.

https://github.com/os-autoinst/openQA/pull/5287 Remove defaults, should be ensured by OpenQA::Setup already - merged, deployed on o3

Now monitoring what's happening

Actions

#39

Updated by tinita almost 2 years ago

File max_running_jobs_grafana_job_queue.png max_running_jobs_grafana_job_queue.png added
Due date changed from 2023-08-23 to 2023-08-30

As discussed, the current implementation was too naive; I was just assuming that the schedule function would only get one job (or a group of jobs depending on each other) as a parameter.

What is happening now is the following:
https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=9&from=1692746389833&to=1692749719627

As soon as $running (the number of jobs in an EXECUTION state) is below the $limit, the schedule function will assign all currently scheduled jobs (with a maximum of MAX_JOB_ALLOCATION - 80 be default), and limited by free workers of course.
Then it will no longer schedule until $running is below $limit again.

The current logic is not what was intended.
It is not that bad though, because the newly assigned jobs (let's say 80) are just starting, and then one job after the other will finish, while no new jobs are assigned. That means the number of jobs in the phase of uploading at the same time will not be that high.

We could say that together with the MAX_JOB_ALLOCATION limit we are probably fine, however, we might want max_running_jobs to be a hard limit (with the exception of parallel clusters that should be assigned together).

Actions

#40

Updated by tinita almost 2 years ago

Status changed from Feedback to In Progress

We decided now that we want max_running_jobs as a hard limit, even if the current combination of max_running_jobs and MAX_JOB_ALLOCATION might actually be the best usage of resources, but it's harder to explain to users what's happening exactly. (And MAX_JOB_ALLOCATION would have to be a configuration variable to easily adjust it, so it would also require a change).

Actions

#41

Updated by tinita over 1 year ago

Due date changed from 2023-08-30 to 2023-09-06

Due to putting out fires and other important tasks - bumping due date

Actions

#42

Updated by tinita over 1 year ago

Draft https://github.com/os-autoinst/openQA/pull/5289

Actions

#43

Updated by tinita over 1 year ago

Status changed from In Progress to Workable

I'm unassigning myself.
No programming possible for the last two weeks, only putting fires out, and it seems it goes on.

Actions

#44

Updated by tinita over 1 year ago

I was wondering why the schedule method takes two additional parameters, as I can't find any code under lib that passes these:

sub schedule ($self, $allocated_workers = {}, $allocated_jobs = {}) {

I found https://github.com/os-autoinst/openQA/pull/3741/files and it looks like the parameters were added to be able to check the contents of those hashes (or actually only $allocated_workers) in a unit test.

Since I'm extracting _allocate_jobs this might not be necessary anymore, and I can remove those parameters.

Actions

#45

Updated by tinita over 1 year ago

Status changed from Workable to In Progress

Actions

#46

Updated by tinita over 1 year ago

Due date changed from 2023-09-06 to 2023-09-13

Actions

#47

Updated by okurz over 1 year ago

Related to action #134927: OSD throws 503, unresponsive for some minutes size:M added

Actions

#48

Updated by tinita over 1 year ago

Still working on the test.
One problem is that 04-scheduler.t so far only had one worker, so there are no tests yet that deal with more than one job assignment.

Actions

#49

Updated by tinita over 1 year ago

Ready for review: https://github.com/os-autoinst/openQA/pull/5289 Make max_running_jobs a hard limit

Actions

#50

Updated by tinita over 1 year ago

Status changed from In Progress to Feedback

Actions

#51

Updated by okurz over 1 year ago

PR approved. As mentioned in https://github.com/os-autoinst/openQA/pull/5289#issuecomment-1713488199 please look into the runtime improvement of the unit test and further refactoring of the code as discussed.

Actions

#52

Updated by livdywan over 1 year ago

okurz wrote in #note-51:

PR approved. As mentioned in https://github.com/os-autoinst/openQA/pull/5289#issuecomment-1713488199 please look into the runtime improvement of the unit test and further refactoring of the code as discussed.

I would lean towards a follow-up ticket here as this has already been in the queue for a bit and probably won't be done today... of course we can discuss this in the Unblock.

Actions

#53

Updated by okurz over 1 year ago

Status changed from Feedback to In Progress

livdywan wrote in #note-52:

okurz wrote in #note-51:

PR approved. As mentioned in https://github.com/os-autoinst/openQA/pull/5289#issuecomment-1713488199 please look into the runtime improvement of the unit test and further refactoring of the code as discussed.

I would lean towards a follow-up ticket here as this has already been in the queue for a bit and probably won't be done today... of course we can discuss this in the Unblock.

No, refactoring and cleanup with no user-facing benefit must be part of the original work. There is no better time for refactoring than now. Of course as a team we should strive to support each other as much as possible to bring this to conclusion as soon as possible.

Actions

#54

Updated by okurz over 1 year ago

Due date changed from 2023-09-13 to 2023-09-29
Priority changed from High to Normal

The feature is effective in production and we can follow-up with the unit test improvements in the following days. Bumping the due-date due to unforeseen distractions I need to put on the shoulders of the team the past days.

Actions

#55

Updated by tinita over 1 year ago

https://github.com/os-autoinst/openQA/pull/5306 scheduler: Log statistics of rejected jobs (#135578)

Actions

#56

Updated by tinita over 1 year ago

Status changed from In Progress to Feedback

Actions

#57

Updated by okurz over 1 year ago

Due date deleted (~~2023-09-29~~)
Status changed from Feedback to Blocked

blocked by #135632

Actions

#58

Updated by okurz over 1 year ago

Related to action #135632: "Mojo::File::spurt is deprecated in favor of Mojo::File::spew" breaking os-autoinst OBS build and osd-deployment size:M added

Actions

#59

Updated by tinita over 1 year ago

https://github.com/os-autoinst/openQA/pull/5306 merged

Actions

#60

Updated by tinita over 1 year ago

Status changed from Blocked to Feedback

deployed yesterday.
/var/log/openqa_scheduler
osd:

[2023-09-25T10:59:56.015244+02:00] [debug] [pid:9919] Skipping 74 jobs because of no free workers for requested worker classes (qemu_ppc64le,tap:22,hmc_ppc64le-1disk:17,qemu_ppc64le-large-mem,tap:13,qemu_x86_64
,tap,worker31:8,virt-mm-64bit-ipmi:4,64bit-ipmi-large-mem:3,openqaworker16,qemu_x86_64,tap:3,64bit-ipmi-amd-zen3:1,generalhw_RPi3B:1,generalhw_RPi3B+:1,generalhw_RPi4:1)

o3:

[2023-09-25T09:00:39.060274Z] [debug] [pid:23000] Skipping 23 jobs because of no free workers for requested worker classes (qemu_ppc64le:15,qemu_aarch32,tap:6,s390x-zVM-vswitch-l2,tap:2)

Actions

#61

Updated by tinita over 1 year ago

Status changed from Feedback to In Progress

Actions

#62

Updated by tinita over 1 year ago

https://github.com/os-autoinst/openQA/pull/5315 Reduce runtime of t/04-scheduler.t

Actions

#63

Updated by openqa_review over 1 year ago

Due date set to 2023-10-12

Setting due date based on mean cycle time of SUSE QE Tools

Actions

#64

Updated by okurz over 1 year ago

https://github.com/os-autoinst/openQA/pull/5315 merged, what's next/missing?

Actions

#65

Updated by tinita over 1 year ago

Status changed from In Progress to Resolved

Actions