action #129490: high response times on osd - Try nginx on o3 with enabled load limiting or load balancing features - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

#1

Updated by okurz almost 2 years ago

Copied from action #129487: high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? size:M added

Actions

Copy link

#2

Updated by okurz almost 2 years ago

Copied to action #129493: high response times on osd - better nice level for velociraptor added

Actions

Copy link

#3

Updated by okurz almost 2 years ago

Target version changed from Ready to future

Actions

Copy link

#4

Updated by okurz almost 2 years ago

Related to action #130312: [tools] URL listing TW snapshots (and the changes therein), has stopped working added

Actions

Copy link

#5

Updated by okurz almost 2 years ago

Related to action #129619: high response times on osd - simple limit of jobs running concurrently in openQA size:M added

Actions

Copy link

#6

Updated by kraih almost 2 years ago

Assignee set to kraih

After the work on #129619, O3 now uses nginx in production. We've learned a lot about how to tune the configuration for openQA in the process. A similar config should work very well for OSD too. The SSL setup and Salt will be a slight complication however.

Actions

Copy link

#7

Updated by livdywan almost 2 years ago

Status changed from New to In Progress
Target version changed from future to Ready

So just for clarity, this is where the nginx work should be at, see #129619#note-19

Actions

Copy link

#8

Updated by kraih almost 2 years ago

I will start preparing a config for OSD now, the same way as i did for O3. Nginx will be deployed in parallel to Apache2, running on ports 8080 and 4433. Once i'm confident that the config works, i will wait for a slow day, and then do a switch where i stop Apache and run Nginx in its place. And finally once all issues have been resolved, Apache will be replaced with Nginx in Salt (i will probably need help there though). I don't expect this to go fast, since i want to be very careful about avoiding downtime.

Actions

Copy link

#9

Updated by kraih almost 2 years ago

Description updated (diff)

Actions

Copy link

#10

Updated by kraih almost 2 years ago

Opened a PR with improvements for the default Nginx config from the O3 deployment: https://github.com/os-autoinst/openQA/pull/5185

Actions

Copy link

#11

Updated by okurz almost 2 years ago

kraih wrote:

I will start preparing a config for OSD now, the same way as i did for O3. Nginx will be deployed in parallel to Apache2, running on ports 8080 and 4433. Once i'm confident that the config works, i will wait for a slow day, and then do a switch where i stop Apache and run Nginx in its place. And finally once all issues have been resolved, Apache will be replaced with Nginx in Salt (i will probably need help there though). I don't expect this to go fast, since i want to be very careful about avoiding downtime.

For OSD I suggest to prepare changes in https://gitlab.suse.de/openqa/salt-states-openqa and not apply any manual changes. Regarding applying the actual changes I suggest to just look into https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls and search for "apache" and apply the corresponding necessary changes from nginx. Having the nginx config files coming from the openQA upstream repo should make that straightforward.

Actions

Copy link

#12

Updated by okurz almost 2 years ago

Can I blame the switch to nginx for https://github.com/os-autoinst/os-autoinst-distri-openQA/actions/runs/5178791698/jobs/9330745528?pr=121#step:5:240 reporting "502 Bad Gateway"? I doubt it's just a coincidence :)

Actions

Copy link

#13

Updated by okurz almost 2 years ago

Priority changed from Normal to Urgent

https://openqa.opensuse.org/tests/3338256 says "Reason: api failure: Connection error: Connection refused". I expect that many more openQA tests fail for the same or a related issue. How about we switch back to apache for the time being to test more during normal working hours?

EDIT: Interestingly https://openqa.opensuse.org/tests?resultfilter=Incomplete only showed the one job https://openqa.opensuse.org/tests/3338256 failing for the same reason, currently no other.

Actions

Copy link

#14

Updated by openqa_review almost 2 years ago

Due date set to 2023-06-20

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#15

Updated by livdywan almost 2 years ago

okurz wrote:

EDIT: Interestingly https://openqa.opensuse.org/tests?resultfilter=Incomplete only showed the one job https://openqa.opensuse.org/tests/3338256 failing for the same reason, currently no other.

Seems like that was a one-off. The same job is now looking fine and I also couldn't find other instances of the API failure.

Actions

Copy link

#16

Updated by kraih almost 2 years ago

okurz wrote:

Can I blame the switch to nginx for https://github.com/os-autoinst/os-autoinst-distri-openQA/actions/runs/5178791698/jobs/9330745528?pr=121#step:5:240 reporting "502 Bad Gateway"? I doubt it's just a coincidence :)

Yes, the problem was that nginx can handle a whole lot more connections than Apache or our app server. It happily bombarded the app server with way too many requests during peak times. The new adjusted config with upstream max_conns limit appears to have taken care of the problem:

# The "max_conns" value should be identical to the maximum number of
# connections the webui is configured to handle concurrently
upstream webui {
    zone upstream_webui 64k;
    server 127.0.0.1:9526 max_conns=30;
}

upstream websocket {
    server 127.0.0.1:9527;
}

upstream livehandler {
    server 127.0.0.1:9528;
}

server {
    listen      80;
    server_name openqa.opensuse.org openqa.infra.opensuse.org;

    root /usr/share/openqa/public;

    client_max_body_size 0;
    client_body_buffer_size 64k;

    location /assets {
        alias /var/lib/openqa/share/factory;
        tcp_nopush         on;
        sendfile           on;
        sendfile_max_chunk 1m;
    }

    location /image {
        alias /var/lib/openqa/images;
        tcp_nopush         on;
        sendfile           on;
        sendfile_max_chunk 1m;
    }

    location /api/v1/ws/ {
        proxy_pass http://websocket;
        proxy_http_version 1.1;
        proxy_read_timeout 3600;
        proxy_send_timeout 3600;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    location /liveviewhandler/ {
        proxy_pass http://livehandler;
        proxy_http_version 1.1;
        proxy_read_timeout 3600;
        proxy_send_timeout 3600;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    location / {
        proxy_pass "http://webui";
        tcp_nodelay        on;
        proxy_read_timeout 900;
        proxy_send_timeout 900;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-Host $host:$server_port;
        proxy_set_header X-Forwarded-Server $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto "https";
    }

    access_log /space/logs/nginx/openqa.access_log;
    error_log /space/logs/nginx/openqa.error_log;
}

Actions

Copy link

#17

Updated by kraih almost 2 years ago

okurz wrote:

For OSD I suggest to prepare changes in https://gitlab.suse.de/openqa/salt-states-openqa and not apply any manual changes. Regarding applying the actual changes I suggest to just look into https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls and search for "apache" and apply the corresponding necessary changes from nginx. Having the nginx config files coming from the openQA upstream repo should make that straightforward.

Not sure i can work it out that way. I can try to make the openQA upstream Nginx config more reusable, but i'll probably not be able to figure out SSL without testing on the actual machine.

Actions

Copy link Download all files

#18

Updated by kraih almost 2 years ago

File processes.png processes.png added
File memory.png memory.png added

Munin graphs show a visible reduction in memory usage since Nginx has been deployed on O3.

Actions

Copy link

#19

Updated by kraih almost 2 years ago

Opened a PR to update the default nginx config again: https://github.com/os-autoinst/openQA/pull/5191

Actions

Copy link

#20

Updated by okurz almost 2 years ago

Related to action #130477: [O3]http connection to O3 repo is broken sporadically in virtualization tests, likely due to systemd dependencies on apache/nginx size:M added

Actions

Copy link

#21

Updated by kraih almost 2 years ago

I've checked the O3 logs again today for unexpected 502 errors. There were two larger batches, one at 07:44 and one at 11:44. But each corresponded with a restart of openqa-webui, so nothing to worry about. Looks like everything is still working as expected.

Actions

Copy link

#22

Updated by kraih almost 2 years ago

Today i've lowered the error_log level to warn, which showed another setting that can be optimised:

2023/06/08 12:12:18 [warn] 19902#19902: *8368071 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003363, client: 192.168.112.17, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345916/status HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:12:19 [warn] 19902#19902: *8368086 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003364, client: 192.168.112.11, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345811/artefact HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:12:19 [warn] 19902#19902: *8368099 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003365, client: 192.168.112.11, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345811/artefact HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:12:19 [warn] 19902#19902: *8368121 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003366, client: 192.168.112.11, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345811/artefact HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:12:19 [warn] 20748#20748: *8368149 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003367, client: 192.168.112.11, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345811/artefact HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:12:20 [warn] 20748#20748: *8368205 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003368, client: 192.168.112.11, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345811/artefact HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:12:20 [warn] 20748#20748: *8368226 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003369, client: 192.168.112.11, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345811/artefact HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:13:15 [warn] 20748#20748: *8375999 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003370, client: 192.168.112.17, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345855/artefact HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:13:33 [warn] 20746#20746: *8378374 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003371, client: 192.168.112.18, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345643/artefact HTTP/1.1", host: "openqa1-opensuse"

This happens with a very high frequency, because we've only set client_body_buffer_size 64k;, which is too low to cover the 1mb chunks of artefact uploads we are usually dealing with. As a solution i've now configured client_body_buffer_size 2mb;. It won't completely eliminate buffering to files, since we still have larger uploads, but it should be enough to eliminate buffering for the vast majority of uploads.

Actions

Copy link

#23

Updated by kraih almost 2 years ago

Also experimented with proxy_request_buffering off, which completely disables upload buffering to files in Nginx. But that results in the app server getting overwhelmed in some cases:

2023/06/08 13:17:19 [error] 4625#4625: *309604 no live upstreams while connecting to upstream, client: 192.168.112.17, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345784/status HTTP/1.1", upstream: "http://webui/api/v1/jobs/3345784/status", host: "openqa1-opensuse"

192.168.112.17 - - [08/Jun/2023:13:17:19 +0000] "POST /api/v1/jobs/3345784/status HTTP/1.1" 502 157 "-" "Mojolicious (Perl)" rt=0.000 urt="0.000"

I find this quite interesting, because with buffering enabled we've not seen this before. That suggests the buffering does a pretty good job protecting the app server and we will be better off with client_body_buffer_size 2mb; in production.

Actions

Copy link

#24

Updated by kraih almost 2 years ago

Opened a PR for the buffer settings: https://github.com/os-autoinst/openQA/pull/5196

Actions

Copy link

#25

Updated by livdywan almost 2 years ago

Subject changed from high response times on osd - Try nginx with enabled load limiting or load balancing features to high response times on osd - Try nginx on o3 with enabled load limiting or load balancing features
Description updated (diff)

Actions

Copy link

#26

Updated by livdywan almost 2 years ago

Copied to action #130636: high response times on osd - Try nginx on OSD size:S added

Actions

Copy link

#27

Updated by kraih almost 2 years ago

Status changed from In Progress to Feedback

With the OSD part of this ticket moved to #130636, i think we can transition to feedback. I have nothing else i want to experiment with. Here's the config we ended up with:

# The "max_conns" value should be identical to the maximum number of
# connections the webui is configured to handle concurrently
upstream webui {
    zone upstream_webui 64k;
    server 127.0.0.1:9526 max_conns=30;
}

upstream websocket {
    server 127.0.0.1:9527;
}

upstream livehandler {
    server 127.0.0.1:9528;
}

# Standard log format with added time
log_format with_time '$remote_addr - $remote_user [$time_local] '
                     '"$request" $status $body_bytes_sent '
                     '"$http_referer" "$http_user_agent" '
                     'rt=$request_time urt="$upstream_response_time"';

server {
    listen      80;
    server_name openqa.opensuse.org openqa.infra.opensuse.org;

    root /usr/share/openqa/public;

    client_max_body_size 0;

    # The "client_body_buffer_size" value should usually be larger
    # than the UPLOAD_CHUNK_SIZE used by openQA workers, so there is
    # no excessive buffering to disk
    client_body_buffer_size 2m;

    location /assets {
        alias /var/lib/openqa/share/factory;
        tcp_nopush         on;
        sendfile           on;
        sendfile_max_chunk 1m;
    }

    location /image {
        alias /var/lib/openqa/images;
        tcp_nopush         on;
        sendfile           on;
        sendfile_max_chunk 1m;
    }

    location /api/v1/ws/ {
        proxy_pass http://websocket;
        proxy_http_version 1.1;
        proxy_read_timeout 3600;
        proxy_send_timeout 3600;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    location /liveviewhandler/ {
        proxy_pass http://livehandler;
        proxy_http_version 1.1;
        proxy_read_timeout 3600;
        proxy_send_timeout 3600;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }

    location / {
        proxy_pass "http://webui";
        tcp_nodelay        on;
        proxy_read_timeout 900;
        proxy_send_timeout 900;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-Host $host:$server_port;
        proxy_set_header X-Forwarded-Server $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto "https";
    }

    # O3 specific tmp paths
    client_body_temp_path /space/tmp/nginx/upload_temp;
    proxy_temp_path /space/tmp/nginx/proxy_temp;

    # O3 specific log paths
    access_log /space/logs/nginx/openqa.access_log with_time;
    error_log /space/logs/nginx/openqa.error_log;
}

Actions

Copy link

#28

Updated by kraih almost 2 years ago

Status changed from Feedback to Resolved

Looks like we have a pretty reliable setup on O3 now.

Actions

Copy link

#29

Updated by okurz almost 2 years ago

Due date deleted (~~2023-06-20~~)

Actions

Copy link

#30

Updated by okurz almost 2 years ago

Related to action #132167: asset uploading failed with http status 502 size:M added

processes.png (172 KB) processes.png		kraih, 2023-06-06 12:16
memory.png (345 KB) memory.png		kraih, 2023-06-06 12:16

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #129490

high response times on osd - Try nginx on o3 with enabled load limiting or load balancing features

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by kraih almost 2 years ago

Updated by livdywan almost 2 years ago

Updated by kraih almost 2 years ago

Updated by kraih almost 2 years ago

Updated by kraih almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by openqa_review almost 2 years ago

Updated by livdywan almost 2 years ago

Updated by kraih almost 2 years ago

Updated by kraih almost 2 years ago

Updated by kraih almost 2 years ago

Updated by kraih almost 2 years ago

Updated by okurz almost 2 years ago

Updated by kraih almost 2 years ago

Updated by kraih almost 2 years ago

Updated by kraih almost 2 years ago

Updated by kraih almost 2 years ago

Updated by livdywan almost 2 years ago

Updated by livdywan almost 2 years ago

Updated by kraih almost 2 years ago

Updated by kraih almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago