Project

General

Profile

Actions

action #129490

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #108209: [epic] Reduce load on OSD

high response times on osd - Try nginx on o3 with enabled load limiting or load balancing features

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Apache in prefork mode uses a lot of resources to provide mediocre performance.

Acceptance criteria

  • AC1: It is known if Nginx rate limiting features work for our use cases
  • AC2: Nginx has been deployed successfully on O3

Suggestions

  • See #129619 for results from previous experiments

Files

processes.png (172 KB) processes.png kraih, 2023-06-06 12:16
memory.png (345 KB) memory.png kraih, 2023-06-06 12:16

Related issues 7 (0 open7 closed)

Related to QA - action #130312: [tools] URL listing TW snapshots (and the changes therein), has stopped workingResolvedkraih2023-06-03

Actions
Related to openQA Project - action #129619: high response times on osd - simple limit of jobs running concurrently in openQA size:MResolvedtinita2023-05-20

Actions
Related to openQA Project - action #130477: [O3]http connection to O3 repo is broken sporadically in virtualization tests, likely due to systemd dependencies on apache/nginx size:MResolvedmkittler2023-06-07

Actions
Related to openQA Project - action #132167: asset uploading failed with http status 502 size:MResolvedkraih2023-06-292023-08-30

Actions
Copied from openQA Project - action #129487: high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? size:MRejectedokurz

Actions
Copied to openQA Infrastructure - action #129493: high response times on osd - better nice level for velociraptorResolvedokurz

Actions
Copied to openQA Project - action #130636: high response times on osd - Try nginx on OSD size:SResolvedmkittler

Actions
Actions #1

Updated by okurz over 1 year ago

  • Copied from action #129487: high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? size:M added
Actions #2

Updated by okurz over 1 year ago

  • Copied to action #129493: high response times on osd - better nice level for velociraptor added
Actions #3

Updated by okurz over 1 year ago

  • Target version changed from Ready to future
Actions #4

Updated by okurz over 1 year ago

  • Related to action #130312: [tools] URL listing TW snapshots (and the changes therein), has stopped working added
Actions #5

Updated by okurz over 1 year ago

  • Related to action #129619: high response times on osd - simple limit of jobs running concurrently in openQA size:M added
Actions #6

Updated by kraih over 1 year ago

  • Assignee set to kraih

After the work on #129619, O3 now uses nginx in production. We've learned a lot about how to tune the configuration for openQA in the process. A similar config should work very well for OSD too. The SSL setup and Salt will be a slight complication however.

Actions #7

Updated by livdywan over 1 year ago

  • Status changed from New to In Progress
  • Target version changed from future to Ready

So just for clarity, this is where the nginx work should be at, see #129619#note-19

Actions #8

Updated by kraih over 1 year ago

I will start preparing a config for OSD now, the same way as i did for O3. Nginx will be deployed in parallel to Apache2, running on ports 8080 and 4433. Once i'm confident that the config works, i will wait for a slow day, and then do a switch where i stop Apache and run Nginx in its place. And finally once all issues have been resolved, Apache will be replaced with Nginx in Salt (i will probably need help there though). I don't expect this to go fast, since i want to be very careful about avoiding downtime.

Actions #9

Updated by kraih over 1 year ago

  • Description updated (diff)
Actions #10

Updated by kraih over 1 year ago

Opened a PR with improvements for the default Nginx config from the O3 deployment: https://github.com/os-autoinst/openQA/pull/5185

Actions #11

Updated by okurz over 1 year ago

kraih wrote:

I will start preparing a config for OSD now, the same way as i did for O3. Nginx will be deployed in parallel to Apache2, running on ports 8080 and 4433. Once i'm confident that the config works, i will wait for a slow day, and then do a switch where i stop Apache and run Nginx in its place. And finally once all issues have been resolved, Apache will be replaced with Nginx in Salt (i will probably need help there though). I don't expect this to go fast, since i want to be very careful about avoiding downtime.

For OSD I suggest to prepare changes in https://gitlab.suse.de/openqa/salt-states-openqa and not apply any manual changes. Regarding applying the actual changes I suggest to just look into https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls and search for "apache" and apply the corresponding necessary changes from nginx. Having the nginx config files coming from the openQA upstream repo should make that straightforward.

Actions #12

Updated by okurz over 1 year ago

Can I blame the switch to nginx for https://github.com/os-autoinst/os-autoinst-distri-openQA/actions/runs/5178791698/jobs/9330745528?pr=121#step:5:240 reporting "502 Bad Gateway"? I doubt it's just a coincidence :)

Actions #13

Updated by okurz over 1 year ago

  • Priority changed from Normal to Urgent

https://openqa.opensuse.org/tests/3338256 says "Reason: api failure: Connection error: Connection refused". I expect that many more openQA tests fail for the same or a related issue. How about we switch back to apache for the time being to test more during normal working hours?

EDIT: Interestingly https://openqa.opensuse.org/tests?resultfilter=Incomplete only showed the one job https://openqa.opensuse.org/tests/3338256 failing for the same reason, currently no other.

Actions #14

Updated by openqa_review over 1 year ago

  • Due date set to 2023-06-20

Setting due date based on mean cycle time of SUSE QE Tools

Actions #15

Updated by livdywan over 1 year ago

okurz wrote:

EDIT: Interestingly https://openqa.opensuse.org/tests?resultfilter=Incomplete only showed the one job https://openqa.opensuse.org/tests/3338256 failing for the same reason, currently no other.

Seems like that was a one-off. The same job is now looking fine and I also couldn't find other instances of the API failure.

Actions #16

Updated by kraih over 1 year ago

okurz wrote:

Can I blame the switch to nginx for https://github.com/os-autoinst/os-autoinst-distri-openQA/actions/runs/5178791698/jobs/9330745528?pr=121#step:5:240 reporting "502 Bad Gateway"? I doubt it's just a coincidence :)

Yes, the problem was that nginx can handle a whole lot more connections than Apache or our app server. It happily bombarded the app server with way too many requests during peak times. The new adjusted config with upstream max_conns limit appears to have taken care of the problem:

# The "max_conns" value should be identical to the maximum number of
# connections the webui is configured to handle concurrently
upstream webui {
    zone upstream_webui 64k;
    server 127.0.0.1:9526 max_conns=30;
}

upstream websocket {
    server 127.0.0.1:9527;
}

upstream livehandler {
    server 127.0.0.1:9528;
}

server {
    listen      80;
    server_name openqa.opensuse.org openqa.infra.opensuse.org;

    root /usr/share/openqa/public;

    client_max_body_size 0;
    client_body_buffer_size 64k;

    location /assets {
        alias /var/lib/openqa/share/factory;
        tcp_nopush         on;
        sendfile           on;
        sendfile_max_chunk 1m;
    }

    location /image {
        alias /var/lib/openqa/images;
        tcp_nopush         on;
        sendfile           on;
        sendfile_max_chunk 1m;
    }

    location /api/v1/ws/ {
        proxy_pass http://websocket;
        proxy_http_version 1.1;
        proxy_read_timeout 3600;
        proxy_send_timeout 3600;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    location /liveviewhandler/ {
        proxy_pass http://livehandler;
        proxy_http_version 1.1;
        proxy_read_timeout 3600;
        proxy_send_timeout 3600;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    location / {
        proxy_pass "http://webui";
        tcp_nodelay        on;
        proxy_read_timeout 900;
        proxy_send_timeout 900;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-Host $host:$server_port;
        proxy_set_header X-Forwarded-Server $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto "https";
    }

    access_log /space/logs/nginx/openqa.access_log;
    error_log /space/logs/nginx/openqa.error_log;
}
Actions #17

Updated by kraih over 1 year ago

okurz wrote:

For OSD I suggest to prepare changes in https://gitlab.suse.de/openqa/salt-states-openqa and not apply any manual changes. Regarding applying the actual changes I suggest to just look into https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls and search for "apache" and apply the corresponding necessary changes from nginx. Having the nginx config files coming from the openQA upstream repo should make that straightforward.

Not sure i can work it out that way. I can try to make the openQA upstream Nginx config more reusable, but i'll probably not be able to figure out SSL without testing on the actual machine.

Updated by kraih over 1 year ago

Munin graphs show a visible reduction in memory usage since Nginx has been deployed on O3.

Actions #19

Updated by kraih over 1 year ago

Opened a PR to update the default nginx config again: https://github.com/os-autoinst/openQA/pull/5191

Actions #20

Updated by okurz over 1 year ago

  • Related to action #130477: [O3]http connection to O3 repo is broken sporadically in virtualization tests, likely due to systemd dependencies on apache/nginx size:M added
Actions #21

Updated by kraih over 1 year ago

I've checked the O3 logs again today for unexpected 502 errors. There were two larger batches, one at 07:44 and one at 11:44. But each corresponded with a restart of openqa-webui, so nothing to worry about. Looks like everything is still working as expected.

Actions #22

Updated by kraih over 1 year ago

Today i've lowered the error_log level to warn, which showed another setting that can be optimised:

2023/06/08 12:12:18 [warn] 19902#19902: *8368071 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003363, client: 192.168.112.17, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345916/status HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:12:19 [warn] 19902#19902: *8368086 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003364, client: 192.168.112.11, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345811/artefact HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:12:19 [warn] 19902#19902: *8368099 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003365, client: 192.168.112.11, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345811/artefact HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:12:19 [warn] 19902#19902: *8368121 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003366, client: 192.168.112.11, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345811/artefact HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:12:19 [warn] 20748#20748: *8368149 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003367, client: 192.168.112.11, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345811/artefact HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:12:20 [warn] 20748#20748: *8368205 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003368, client: 192.168.112.11, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345811/artefact HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:12:20 [warn] 20748#20748: *8368226 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003369, client: 192.168.112.11, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345811/artefact HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:13:15 [warn] 20748#20748: *8375999 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003370, client: 192.168.112.17, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345855/artefact HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:13:33 [warn] 20746#20746: *8378374 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003371, client: 192.168.112.18, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345643/artefact HTTP/1.1", host: "openqa1-opensuse"

This happens with a very high frequency, because we've only set client_body_buffer_size 64k;, which is too low to cover the 1mb chunks of artefact uploads we are usually dealing with. As a solution i've now configured client_body_buffer_size 2mb;. It won't completely eliminate buffering to files, since we still have larger uploads, but it should be enough to eliminate buffering for the vast majority of uploads.

Actions #23

Updated by kraih over 1 year ago

Also experimented with proxy_request_buffering off, which completely disables upload buffering to files in Nginx. But that results in the app server getting overwhelmed in some cases:

2023/06/08 13:17:19 [error] 4625#4625: *309604 no live upstreams while connecting to upstream, client: 192.168.112.17, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345784/status HTTP/1.1", upstream: "http://webui/api/v1/jobs/3345784/status", host: "openqa1-opensuse"
192.168.112.17 - - [08/Jun/2023:13:17:19 +0000] "POST /api/v1/jobs/3345784/status HTTP/1.1" 502 157 "-" "Mojolicious (Perl)" rt=0.000 urt="0.000"

I find this quite interesting, because with buffering enabled we've not seen this before. That suggests the buffering does a pretty good job protecting the app server and we will be better off with client_body_buffer_size 2mb; in production.

Actions #24

Updated by kraih over 1 year ago

Opened a PR for the buffer settings: https://github.com/os-autoinst/openQA/pull/5196

Actions #25

Updated by livdywan over 1 year ago

  • Subject changed from high response times on osd - Try nginx with enabled load limiting or load balancing features to high response times on osd - Try nginx on o3 with enabled load limiting or load balancing features
  • Description updated (diff)
Actions #26

Updated by livdywan over 1 year ago

  • Copied to action #130636: high response times on osd - Try nginx on OSD size:S added
Actions #27

Updated by kraih over 1 year ago

  • Status changed from In Progress to Feedback

With the OSD part of this ticket moved to #130636, i think we can transition to feedback. I have nothing else i want to experiment with. Here's the config we ended up with:

# The "max_conns" value should be identical to the maximum number of
# connections the webui is configured to handle concurrently
upstream webui {
    zone upstream_webui 64k;
    server 127.0.0.1:9526 max_conns=30;
}

upstream websocket {
    server 127.0.0.1:9527;
}

upstream livehandler {
    server 127.0.0.1:9528;
}

# Standard log format with added time
log_format with_time '$remote_addr - $remote_user [$time_local] '
                     '"$request" $status $body_bytes_sent '
                     '"$http_referer" "$http_user_agent" '
                     'rt=$request_time urt="$upstream_response_time"';

server {
    listen      80;
    server_name openqa.opensuse.org openqa.infra.opensuse.org;

    root /usr/share/openqa/public;

    client_max_body_size 0;

    # The "client_body_buffer_size" value should usually be larger
    # than the UPLOAD_CHUNK_SIZE used by openQA workers, so there is
    # no excessive buffering to disk
    client_body_buffer_size 2m;

    location /assets {
        alias /var/lib/openqa/share/factory;
        tcp_nopush         on;
        sendfile           on;
        sendfile_max_chunk 1m;
    }

    location /image {
        alias /var/lib/openqa/images;
        tcp_nopush         on;
        sendfile           on;
        sendfile_max_chunk 1m;
    }

    location /api/v1/ws/ {
        proxy_pass http://websocket;
        proxy_http_version 1.1;
        proxy_read_timeout 3600;
        proxy_send_timeout 3600;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    location /liveviewhandler/ {
        proxy_pass http://livehandler;
        proxy_http_version 1.1;
        proxy_read_timeout 3600;
        proxy_send_timeout 3600;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    location / {
        proxy_pass "http://webui";
        tcp_nodelay        on;
        proxy_read_timeout 900;
        proxy_send_timeout 900;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-Host $host:$server_port;
        proxy_set_header X-Forwarded-Server $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto "https";
    }

    # O3 specific tmp paths
    client_body_temp_path /space/tmp/nginx/upload_temp;
    proxy_temp_path /space/tmp/nginx/proxy_temp;

    # O3 specific log paths
    access_log /space/logs/nginx/openqa.access_log with_time;
    error_log /space/logs/nginx/openqa.error_log;
}
Actions #28

Updated by kraih over 1 year ago

  • Status changed from Feedback to Resolved

Looks like we have a pretty reliable setup on O3 now.

Actions #29

Updated by okurz over 1 year ago

  • Due date deleted (2023-06-20)
Actions #30

Updated by okurz over 1 year ago

  • Related to action #132167: asset uploading failed with http status 502 size:M added
Actions

Also available in: Atom PDF