action #129490
closedcoordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
coordination #108209: [epic] Reduce load on OSD
high response times on osd - Try nginx on o3 with enabled load limiting or load balancing features
Added by okurz over 1 year ago. Updated over 1 year ago.
Description
Files
processes.png (172 KB) processes.png | kraih, 2023-06-06 12:16 | ||
memory.png (345 KB) memory.png | kraih, 2023-06-06 12:16 |
Updated by okurz over 1 year ago
- Copied from action #129487: high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? size:M added
Updated by okurz over 1 year ago
- Copied to action #129493: high response times on osd - better nice level for velociraptor added
Updated by okurz over 1 year ago
- Related to action #130312: [tools] URL listing TW snapshots (and the changes therein), has stopped working added
Updated by okurz over 1 year ago
- Related to action #129619: high response times on osd - simple limit of jobs running concurrently in openQA size:M added
Updated by kraih over 1 year ago
- Assignee set to kraih
After the work on #129619, O3 now uses nginx in production. We've learned a lot about how to tune the configuration for openQA in the process. A similar config should work very well for OSD too. The SSL setup and Salt will be a slight complication however.
Updated by livdywan over 1 year ago
- Status changed from New to In Progress
- Target version changed from future to Ready
So just for clarity, this is where the nginx work should be at, see #129619#note-19
Updated by kraih over 1 year ago
I will start preparing a config for OSD now, the same way as i did for O3. Nginx will be deployed in parallel to Apache2, running on ports 8080 and 4433. Once i'm confident that the config works, i will wait for a slow day, and then do a switch where i stop Apache and run Nginx in its place. And finally once all issues have been resolved, Apache will be replaced with Nginx in Salt (i will probably need help there though). I don't expect this to go fast, since i want to be very careful about avoiding downtime.
Updated by kraih over 1 year ago
Opened a PR with improvements for the default Nginx config from the O3 deployment: https://github.com/os-autoinst/openQA/pull/5185
Updated by okurz over 1 year ago
kraih wrote:
I will start preparing a config for OSD now, the same way as i did for O3. Nginx will be deployed in parallel to Apache2, running on ports 8080 and 4433. Once i'm confident that the config works, i will wait for a slow day, and then do a switch where i stop Apache and run Nginx in its place. And finally once all issues have been resolved, Apache will be replaced with Nginx in Salt (i will probably need help there though). I don't expect this to go fast, since i want to be very careful about avoiding downtime.
For OSD I suggest to prepare changes in https://gitlab.suse.de/openqa/salt-states-openqa and not apply any manual changes. Regarding applying the actual changes I suggest to just look into https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls and search for "apache" and apply the corresponding necessary changes from nginx. Having the nginx config files coming from the openQA upstream repo should make that straightforward.
Updated by okurz over 1 year ago
Can I blame the switch to nginx for https://github.com/os-autoinst/os-autoinst-distri-openQA/actions/runs/5178791698/jobs/9330745528?pr=121#step:5:240 reporting "502 Bad Gateway"? I doubt it's just a coincidence :)
Updated by okurz over 1 year ago
- Priority changed from Normal to Urgent
https://openqa.opensuse.org/tests/3338256 says "Reason: api failure: Connection error: Connection refused". I expect that many more openQA tests fail for the same or a related issue. How about we switch back to apache for the time being to test more during normal working hours?
EDIT: Interestingly https://openqa.opensuse.org/tests?resultfilter=Incomplete only showed the one job https://openqa.opensuse.org/tests/3338256 failing for the same reason, currently no other.
Updated by openqa_review over 1 year ago
- Due date set to 2023-06-20
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan over 1 year ago
okurz wrote:
EDIT: Interestingly https://openqa.opensuse.org/tests?resultfilter=Incomplete only showed the one job https://openqa.opensuse.org/tests/3338256 failing for the same reason, currently no other.
Seems like that was a one-off. The same job is now looking fine and I also couldn't find other instances of the API failure.
Updated by kraih over 1 year ago
okurz wrote:
Can I blame the switch to nginx for https://github.com/os-autoinst/os-autoinst-distri-openQA/actions/runs/5178791698/jobs/9330745528?pr=121#step:5:240 reporting "502 Bad Gateway"? I doubt it's just a coincidence :)
Yes, the problem was that nginx can handle a whole lot more connections than Apache or our app server. It happily bombarded the app server with way too many requests during peak times. The new adjusted config with upstream max_conns
limit appears to have taken care of the problem:
# The "max_conns" value should be identical to the maximum number of
# connections the webui is configured to handle concurrently
upstream webui {
zone upstream_webui 64k;
server 127.0.0.1:9526 max_conns=30;
}
upstream websocket {
server 127.0.0.1:9527;
}
upstream livehandler {
server 127.0.0.1:9528;
}
server {
listen 80;
server_name openqa.opensuse.org openqa.infra.opensuse.org;
root /usr/share/openqa/public;
client_max_body_size 0;
client_body_buffer_size 64k;
location /assets {
alias /var/lib/openqa/share/factory;
tcp_nopush on;
sendfile on;
sendfile_max_chunk 1m;
}
location /image {
alias /var/lib/openqa/images;
tcp_nopush on;
sendfile on;
sendfile_max_chunk 1m;
}
location /api/v1/ws/ {
proxy_pass http://websocket;
proxy_http_version 1.1;
proxy_read_timeout 3600;
proxy_send_timeout 3600;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
}
location /liveviewhandler/ {
proxy_pass http://livehandler;
proxy_http_version 1.1;
proxy_read_timeout 3600;
proxy_send_timeout 3600;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
}
location / {
proxy_pass "http://webui";
tcp_nodelay on;
proxy_read_timeout 900;
proxy_send_timeout 900;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-Host $host:$server_port;
proxy_set_header X-Forwarded-Server $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto "https";
}
access_log /space/logs/nginx/openqa.access_log;
error_log /space/logs/nginx/openqa.error_log;
}
Updated by kraih over 1 year ago
okurz wrote:
For OSD I suggest to prepare changes in https://gitlab.suse.de/openqa/salt-states-openqa and not apply any manual changes. Regarding applying the actual changes I suggest to just look into https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls and search for "apache" and apply the corresponding necessary changes from nginx. Having the nginx config files coming from the openQA upstream repo should make that straightforward.
Not sure i can work it out that way. I can try to make the openQA upstream Nginx config more reusable, but i'll probably not be able to figure out SSL without testing on the actual machine.
Updated by kraih over 1 year ago
- File processes.png processes.png added
- File memory.png memory.png added
Munin graphs show a visible reduction in memory usage since Nginx has been deployed on O3.
Updated by kraih over 1 year ago
Opened a PR to update the default nginx config again: https://github.com/os-autoinst/openQA/pull/5191
Updated by okurz over 1 year ago
- Related to action #130477: [O3]http connection to O3 repo is broken sporadically in virtualization tests, likely due to systemd dependencies on apache/nginx size:M added
Updated by kraih over 1 year ago
I've checked the O3 logs again today for unexpected 502 errors. There were two larger batches, one at 07:44
and one at 11:44
. But each corresponded with a restart of openqa-webui
, so nothing to worry about. Looks like everything is still working as expected.
Updated by kraih over 1 year ago
Today i've lowered the error_log
level to warn
, which showed another setting that can be optimised:
2023/06/08 12:12:18 [warn] 19902#19902: *8368071 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003363, client: 192.168.112.17, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345916/status HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:12:19 [warn] 19902#19902: *8368086 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003364, client: 192.168.112.11, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345811/artefact HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:12:19 [warn] 19902#19902: *8368099 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003365, client: 192.168.112.11, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345811/artefact HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:12:19 [warn] 19902#19902: *8368121 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003366, client: 192.168.112.11, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345811/artefact HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:12:19 [warn] 20748#20748: *8368149 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003367, client: 192.168.112.11, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345811/artefact HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:12:20 [warn] 20748#20748: *8368205 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003368, client: 192.168.112.11, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345811/artefact HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:12:20 [warn] 20748#20748: *8368226 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003369, client: 192.168.112.11, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345811/artefact HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:13:15 [warn] 20748#20748: *8375999 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003370, client: 192.168.112.17, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345855/artefact HTTP/1.1", host: "openqa1-opensuse"
2023/06/08 12:13:33 [warn] 20746#20746: *8378374 a client request body is buffered to a temporary file /var/lib/nginx/tmp//0000003371, client: 192.168.112.18, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345643/artefact HTTP/1.1", host: "openqa1-opensuse"
This happens with a very high frequency, because we've only set client_body_buffer_size 64k;
, which is too low to cover the 1mb
chunks of artefact uploads we are usually dealing with. As a solution i've now configured client_body_buffer_size 2mb;
. It won't completely eliminate buffering to files, since we still have larger uploads, but it should be enough to eliminate buffering for the vast majority of uploads.
Updated by kraih over 1 year ago
Also experimented with proxy_request_buffering off
, which completely disables upload buffering to files in Nginx. But that results in the app server getting overwhelmed in some cases:
2023/06/08 13:17:19 [error] 4625#4625: *309604 no live upstreams while connecting to upstream, client: 192.168.112.17, server: openqa.opensuse.org, request: "POST /api/v1/jobs/3345784/status HTTP/1.1", upstream: "http://webui/api/v1/jobs/3345784/status", host: "openqa1-opensuse"
192.168.112.17 - - [08/Jun/2023:13:17:19 +0000] "POST /api/v1/jobs/3345784/status HTTP/1.1" 502 157 "-" "Mojolicious (Perl)" rt=0.000 urt="0.000"
I find this quite interesting, because with buffering enabled we've not seen this before. That suggests the buffering does a pretty good job protecting the app server and we will be better off with client_body_buffer_size 2mb;
in production.
Updated by kraih over 1 year ago
Opened a PR for the buffer settings: https://github.com/os-autoinst/openQA/pull/5196
Updated by livdywan over 1 year ago
- Subject changed from high response times on osd - Try nginx with enabled load limiting or load balancing features to high response times on osd - Try nginx on o3 with enabled load limiting or load balancing features
- Description updated (diff)
Updated by livdywan over 1 year ago
- Copied to action #130636: high response times on osd - Try nginx on OSD size:S added
Updated by kraih over 1 year ago
- Status changed from In Progress to Feedback
With the OSD part of this ticket moved to #130636, i think we can transition to feedback. I have nothing else i want to experiment with. Here's the config we ended up with:
# The "max_conns" value should be identical to the maximum number of
# connections the webui is configured to handle concurrently
upstream webui {
zone upstream_webui 64k;
server 127.0.0.1:9526 max_conns=30;
}
upstream websocket {
server 127.0.0.1:9527;
}
upstream livehandler {
server 127.0.0.1:9528;
}
# Standard log format with added time
log_format with_time '$remote_addr - $remote_user [$time_local] '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent" '
'rt=$request_time urt="$upstream_response_time"';
server {
listen 80;
server_name openqa.opensuse.org openqa.infra.opensuse.org;
root /usr/share/openqa/public;
client_max_body_size 0;
# The "client_body_buffer_size" value should usually be larger
# than the UPLOAD_CHUNK_SIZE used by openQA workers, so there is
# no excessive buffering to disk
client_body_buffer_size 2m;
location /assets {
alias /var/lib/openqa/share/factory;
tcp_nopush on;
sendfile on;
sendfile_max_chunk 1m;
}
location /image {
alias /var/lib/openqa/images;
tcp_nopush on;
sendfile on;
sendfile_max_chunk 1m;
}
location /api/v1/ws/ {
proxy_pass http://websocket;
proxy_http_version 1.1;
proxy_read_timeout 3600;
proxy_send_timeout 3600;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
}
location /liveviewhandler/ {
proxy_pass http://livehandler;
proxy_http_version 1.1;
proxy_read_timeout 3600;
proxy_send_timeout 3600;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
}
location / {
proxy_pass "http://webui";
tcp_nodelay on;
proxy_read_timeout 900;
proxy_send_timeout 900;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-Host $host:$server_port;
proxy_set_header X-Forwarded-Server $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto "https";
}
# O3 specific tmp paths
client_body_temp_path /space/tmp/nginx/upload_temp;
proxy_temp_path /space/tmp/nginx/proxy_temp;
# O3 specific log paths
access_log /space/logs/nginx/openqa.access_log with_time;
error_log /space/logs/nginx/openqa.error_log;
}
Updated by kraih over 1 year ago
- Status changed from Feedback to Resolved
Looks like we have a pretty reliable setup on O3 now.
Updated by okurz over 1 year ago
- Related to action #132167: asset uploading failed with http status 502 size:M added