Project

General

Profile

Actions

action #160985

open

nginx on osd returns 503 Service Temporarily Unavailable when accessing assets due to rate limiting

Added by jbaier_cz about 1 month ago. Updated 25 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
2024-05-27
Due date:
% Done:

0%

Estimated time:

Description

Observation

After merging https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1182 which added rate limits to nginx, we do see job failures on multiple places.

The nginx log:

2024/05/27 13:20:55 [error] 15939#15939: *654335 limiting requests, excess: 0.880 by zone "defaultlimit", client: 2a07:de40:b203:12:7ec2:55ff:fe24:de9c, server: openqa.suse.de, request: "GET /assets/iso/SLE-15-SP5-Online-ppc64le-Build139.11-Media1.iso HTTP/1.1", host: "openqa.suse.de"
2024/05/27 13:20:55 [error] 15939#15939: *654377 limiting requests, excess: 1.860 by zone "defaultlimit", client: 2a07:de40:b203:12:7ec2:55ff:fe24:de2a, server: openqa.suse.de, request: "GET /assets/hdd/sle-15-SP5-x86_64-Build:34037:freeradius-server-HA-BV.qcow2 HTTP/1.1", host: "openqa.suse.de"
2024/05/27 13:20:55 [error] 15939#15939: *654336 limiting requests, excess: 0.880 by zone "defaultlimit", client: 2a07:de40:b203:12:7ec2:55ff:fe24:de9c, server: openqa.suse.de, request: "GET /assets/hdd/openqa_support_server_sles12sp3.x86_64.qcow2 HTTP/1.1", host: "openqa.suse.de"
2024/05/27 13:20:55 [error] 15939#15939: *654433 limiting requests, excess: 0.850 by zone "defaultlimit", client: 10.100.96.76, server: openqa.suse.de, request: "GET /assets/hdd/SLED-15-SP4-:34041:xdg-desktop-portal-x86_64@64bit_for_regression.qcow2 HTTP/1.1", host: "openqa.suse.de"
2024/05/27 13:20:55 [error] 15939#15939: *654995 limiting requests, excess: 0.830 by zone "defaultlimit", client: 2a07:de40:b203:12:7ec2:55ff:fe24:de2a, server: openqa.suse.de, request: "GET /assets/hdd/sle-15-SP4-x86_64-Build:34037:freeradius-server-HA-BV.qcow2 HTTP/1.1", host: "openqa.suse.de"

Gitlab pipelines: https://gitlab.suse.de/openqa/scripts-ci/-/pipelines/1143088

Jobs: https://openqa.suse.de/tests/14443476

The merge request was reverted (hence normal priority for now) in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1197 until we found out better settings for the limits


Related issues 1 (0 open1 closed)

Has duplicate openQA Infrastructure - action #160982: [openQA] Failed to download SL-Micro 6.0 iso image Rejected2024-05-27

Actions
Actions #1

Updated by nicksinger about 1 month ago

  • Has duplicate action #160982: [openQA] Failed to download SL-Micro 6.0 iso image added
Actions #2

Updated by tinita about 1 month ago

After the MR was reverted I did:

failed_since="2024-05-27 12:20Z" result="result='incomplete'" host=openqa.suse.de comment="label:poo160985" ./openqa-advanced-retrigger-jobs

and it did nothing. I added debugging and checked the SQL result manually, and there were no other incompletes besides https://openqa.suse.de/tests/14443476 which I restarted manually.

The script could be a bit more informative on what it is doing.
Also it relies on ssh openqa.suse.de doing the right thing. See also #160889

Actions #3

Updated by ph03nix about 1 month ago · Edited

From today's access.log I extracted the busiest hosts below. I only list hosts with more than 1000 requests in total. I list the absolute number of requests of the whole access.log log.

# Requests - Host
556914 - 2a07:de40:b203:12:7ec2:55ff:fe24:dc34
400445 - 2a07:de40:b203:12:7ec2:55ff:fe24:de9c
358383 - 2a07:de40:b203:12:7ec2:55ff:fe24:de2a
322646 - 2a07:de40:b203:12:7ec2:55ff:fe24:decc
316801 - 2a07:de40:b203:12:262:bff:fe31:b160
312817 - 2a07:de40:b203:12:262:bff:fe31:6430
298589 - 2a07:de40:b203:12:7ec2:55ff:fe24:dece
282749 - 2a07:de40:b203:12:3eec:efff:fefe:ac4
262131 - 2a07:de40:b203:12:7ec2:55ff:fe24:dde8
257918 - 2a07:de40:b203:12:7ec2:55ff:fe24:de70
147894 - 2a07:de40:a102:5:b645:6ff:fee9:4f51
100368 - 2a07:de40:a102:5:b645:6ff:fee9:48eb
98622 - 2a07:de40:a102:5:b645:6ff:fee9:49bd
92703 - 10.100.96.76
85252 - 10.100.96.74
84023 - 10.100.96.72
80610 - 10.144.53.194
74951 - 2a07:de40:a102:5:2e60:cff:fe73:3d6
66113 - 10.100.96.68
59274 - 10.144.53.193
49374 - 2a07:de40:a102:5:1e1b:dff:fe68:7ec7
42904 - 2a07:de40:b203:12:4048:e1ff:fed7:b5d0
36547 - 2a07:de40:a102:5:225:90ff:fea2:ba92
16362 - 2a07:de40:b203:12:0:ff:fe4f:7c2b
14833 - 2a07:de40:a102:5:42f2:e9ff:fe31:5460
11188 - 2a07:de40:a102:5:9abe:94ff:fe03:e94b
10984 - 10.67.18.153
10686 - 2a07:de40:b2bf:1b::104c
10346 - 10.145.10.33
8588 - 2a07:de40:a102:5:9abe:94ff:fe04:4817
7414 - 10.100.101.74
6756 - 10.100.209.142
6198 - 10.100.51.125
5649 - 10.100.206.26
5468 - 2a07:de40:b230:1:5054:ff:fe18:132b
5309 - 2a07:de40:b2bf:1b::1076
4454 - 10.149.213.14
4443 - 10.202.32.18
4416 - 10.202.32.47
3647 - 10.168.192.203
3621 - 10.145.10.3
3611 - 10.146.4.42
3254 - 2a07:de40:b230:1:ae1f:6bff:fe47:3ce
3235 - 2a07:de40:a102:5:ec4:7aff:fe6c:400a
3222 - 10.145.10.34
3097 - 2a07:de40:b2bf:1b::117b
2917 - 10.202.32.22
2252 - 2a07:de40:b2bf:1b::10c6
2248 - 10.145.10.5
2225 - 10.100.204.18
2192 - 10.202.48.10
2150 - 10.145.10.2
2143 - 10.145.10.13
1976 - 2a07:de40:b230:1:ae1f:6bff:fe01:130
1961 - 2a07:de40:a102:5:ae1f:6bff:fe47:687
1930 - 10.202.32.16
1881 - 10.100.206.82
1814 - 10.168.192.179
1783 - 10.168.192.91
1410 - 10.100.101.80
1403 - 2a07:de40:b2bf:1b::1117
1372 - 10.100.202.142
1361 - 2a07:de40:b204:4:10:145:54:28
1330 - 10.100.224.22
1257 - 2a07:de40:a102:5:225:90ff:fee5:530
1191 - 2a07:de40:b2bf:1b::1082
1080 - 10.145.10.7
1071 - 10.168.192.180
Actions #4

Updated by jbaier_cz about 1 month ago

A little tweak with piping that into while read -r; do n=$(cut -f1 -d' ' <<<"$REPLY"); h=$(cut -f3 -d' ' <<<"$REPLY"); dns="$(dig +short -x "$h" @dns1.suse.de)"; echo "$n - ${dns:-$h}"; done

556914 - worker30.oqa.prg2.suse.org.
400445 - worker29.oqa.prg2.suse.org.
358383 - worker33.oqa.prg2.suse.org.
322646 - worker35.oqa.prg2.suse.org.
316801 - worker-arm1.oqa.prg2.suse.org.
312817 - worker-arm2.oqa.prg2.suse.org.
298589 - worker40.oqa.prg2.suse.org.
282749 - worker32.oqa.prg2.suse.org.
262131 - worker34.oqa.prg2.suse.org.
257918 - worker31.oqa.prg2.suse.org.
147894 - 2a07:de40:a102:5:b645:6ff:fee9:4f51
100368 - 2a07:de40:a102:5:b645:6ff:fee9:48eb
98622 - 2a07:de40:a102:5:b645:6ff:fee9:49bd
92703 - openqaworker18.qa.suse.cz.
85252 - openqaworker17.qa.suse.cz.
84023 - openqaworker16.qa.suse.cz.
80610 - gitlab-worker2.prg2.suse.org.
74951 - 2a07:de40:a102:5:2e60:cff:fe73:3d6
66113 - openqaworker14.qa.suse.cz.
59274 - gitlab-worker1.prg2.suse.org.
49374 - 2a07:de40:a102:5:1e1b:dff:fe68:7ec7
42904 - grenache-1.oqa.prg2.suse.org.
36547 - 2a07:de40:a102:5:225:90ff:fea2:ba92
16362 - openqa.oqa.prg2.suse.org.
14833 - 2a07:de40:a102:5:42f2:e9ff:fe31:5460
11188 - 2a07:de40:a102:5:9abe:94ff:fe03:e94b
10984 - 10.67.18.153
10686 - 2a07:de40:b2bf:1b::104c
10346 - worker-arm1.oqa.prg2.suse.org.
8588 - 2a07:de40:a102:5:9abe:94ff:fe04:4817
7414 - qesapworker-prg4.qa.suse.cz.
6756 - 10.100.209.142
6198 - dhcp125.suse.cz.
5649 - 10.100.206.26
5468 - 2a07:de40:b230:1:5054:ff:fe18:132b
5309 - 2a07:de40:b2bf:1b::1076
4454 - 10.149.213.14
4443 - 10.202.32.18
4416 - 10.202.32.47
3647 - openqaipmi5.qe.nue2.suse.org.
3621 - worker30.oqa.prg2.suse.org.
3611 - fozzie.qe.prg2.suse.org.
3254 - fozzie.qe.prg2.suse.org.
3235 - 2a07:de40:a102:5:ec4:7aff:fe6c:400a
3222 - worker-arm2.oqa.prg2.suse.org.
3097 - 2a07:de40:b2bf:1b::117b
2917 - 10.202.32.22
2252 - 2a07:de40:b2bf:1b::10c6
2248 - worker32.oqa.prg2.suse.org.
2225 - 10.100.204.18
2192 - 10.202.48.10
2150 - worker29.oqa.prg2.suse.org.
2143 - worker40.oqa.prg2.suse.org.
1976 - quinn.qe.prg2.suse.org.
1961 - 2a07:de40:a102:5:ae1f:6bff:fe47:687
1930 - 10.202.32.16
1881 - 10.100.206.82
1814 - sapworker1.qe.nue2.suse.org.
1783 - gonzo-1.qe.nue2.suse.org.
1410 - qesapworker-prg7.qa.suse.cz.
1403 - 2a07:de40:b2bf:1b::1117
1372 - 10.100.202.142
1361 - ibs-botmaster.ibs-inbound-s.buildops.prg2.suse.org.
1330 - vpelcak.udp.ovpn1.prg.suse.de.
1257 - 2a07:de40:a102:5:225:90ff:fee5:530
1191 - 2a07:de40:b2bf:1b::1082
1080 - worker34.oqa.prg2.suse.org.
1071 - sapworker2.qe.nue2.suse.org.
Actions #5

Updated by ph03nix about 1 month ago · Edited

And this is another "naughty: list, maximum number of requests per second per host.

So, e.g.

593 - 2a07:de40:b2bf:1b::1076

This means that host 2a07:de40:b2bf:1b::1076 was doing at most 593 requests per second in today's access.log.

593 - 2a07:de40:b2bf:1b::1076
484 - 2a07:de40:b2bf:1b::104c
469 - 2a07:de40:b203:12:7ec2:55ff:fe24:dc34
446 - 10.100.202.142
426 - 10.100.224.22
405 - 10.100.206.26
372 - 10.100.206.82
354 - 2a07:de40:b203:12:262:bff:fe31:6430
314 - 10.67.18.153
311 - 10.100.51.125
306 - 10.100.227.218
295 - 10.144.53.193
283 - 2a07:de40:b203:12:262:bff:fe31:b160
283 - 2a07:de40:b2bf:1b::10c6
278 - 2a07:de40:b2bf:1b::1071
273 - 10.202.32.25
250 - 2a07:de40:b203:12:7ec2:55ff:fe24:de2a
233 - 10.144.53.194
231 - 2a07:de40:b203:12:7ec2:55ff:fe24:dde8
226 - 2a07:de40:b203:12:7ec2:55ff:fe24:de9c
225 - 10.100.230.222
217 - 2a07:de40:b203:12:7ec2:55ff:fe24:decc
215 - 2a07:de40:b203:12:3eec:efff:fefe:ac4
215 - 10.100.225.186
210 - 2a07:de40:b203:12:7ec2:55ff:fe24:dece
210 - 2a07:de40:b203:12:7ec2:55ff:fe24:de70
200 - 2a07:de40:b2bf:1b::1117
193 - 2a07:de40:b2bf:1b::1106
190 - 10.100.209.142
182 - 2a07:de40:b2bf:1b::1082
181 - 10.202.0.14
179 - 10.84.228.18
161 - 10.100.96.74
157 - 10.100.96.76
139 - 10.100.203.174
137 - 10.149.212.2
136 - 10.149.213.14
133 - 10.100.96.72
126 - 10.149.211.230
115 - 2a07:de40:a102:5:b645:6ff:fee9:4f51
114 - 2a07:de40:a102:5:b645:6ff:fee9:49bd
112 - 10.100.96.68
110 - 2a07:de40:a102:5:2e60:cff:fe73:3d6
109 - 10.100.204.18
109 - 10.202.0.20
108 - 10.100.200.146
102 - 2a07:de40:b2bf:1b::117b
Actions #6

Updated by ph03nix about 1 month ago

And with DNS names:

593 - 2a07:de40:b2bf:1b::1076
484 - 2a07:de40:b2bf:1b::104c
469 - 2a07:de40:b203:12:7ec2:55ff:fe24:dc34
446 - 10.100.202.142
426 - 10.100.224.22
405 - 10.100.206.26
372 - 10.100.206.82
354 - 2a07:de40:b203:12:262:bff:fe31:6430
314 - 10.67.18.153
311 - 10.100.51.125
306 - 10.100.227.218
295 - 10.144.53.193
283 - 2a07:de40:b203:12:262:bff:fe31:b160
283 - 2a07:de40:b2bf:1b::10c6
278 - 2a07:de40:b2bf:1b::1071
273 - 10.202.32.25
250 - 2a07:de40:b203:12:7ec2:55ff:fe24:de2a
233 - 10.144.53.194
231 - 2a07:de40:b203:12:7ec2:55ff:fe24:dde8
226 - 2a07:de40:b203:12:7ec2:55ff:fe24:de9c
225 - 10.100.230.222
217 - 2a07:de40:b203:12:7ec2:55ff:fe24:decc
215 - 2a07:de40:b203:12:3eec:efff:fefe:ac4
215 - 10.100.225.186
210 - 2a07:de40:b203:12:7ec2:55ff:fe24:dece
210 - 2a07:de40:b203:12:7ec2:55ff:fe24:de70
200 - 2a07:de40:b2bf:1b::1117
193 - 2a07:de40:b2bf:1b::1106
190 - 10.100.209.142
182 - 2a07:de40:b2bf:1b::1082
181 - 10.202.0.14
179 - 10.84.228.18
161 - 10.100.96.74
157 - 10.100.96.76
139 - 10.100.203.174
137 - 10.149.212.2
136 - 10.149.213.14
133 - 10.100.96.72
126 - 10.149.211.230
115 - 2a07:de40:a102:5:b645:6ff:fee9:4f51
114 - 2a07:de40:a102:5:b645:6ff:fee9:49bd
112 - 10.100.96.68
110 - 2a07:de40:a102:5:2e60:cff:fe73:3d6
109 - 10.100.204.18
109 - 10.202.0.20
108 - 10.100.200.146
102 - 2a07:de40:b2bf:1b::117b593 - 2a07:de40:b2bf:1b::1076
484 - 2a07:de40:b2bf:1b::104c
469 - worker30.oqa.prg2.suse.org.
446 - 10.100.202.142
426 - vpelcak.udp.ovpn1.prg.suse.de.
405 - 10.100.206.26
372 - 10.100.206.82
354 - worker-arm2.oqa.prg2.suse.org.
314 - 10.67.18.153
311 - dhcp125.suse.cz.
306 - 10.100.227.218
295 - gitlab-worker1.prg2.suse.org.
283 - worker-arm1.oqa.prg2.suse.org.
283 - 2a07:de40:b2bf:1b::10c6
278 - 2a07:de40:b2bf:1b::1071
273 - 10.202.32.25
250 - worker33.oqa.prg2.suse.org.
233 - gitlab-worker2.prg2.suse.org.
231 - worker34.oqa.prg2.suse.org.
226 - worker29.oqa.prg2.suse.org.
225 - 10.100.230.222
217 - worker35.oqa.prg2.suse.org.
215 - worker32.oqa.prg2.suse.org.
215 - 10.100.225.186
210 - worker40.oqa.prg2.suse.org.
210 - worker31.oqa.prg2.suse.org.
200 - 2a07:de40:b2bf:1b::1117
193 - 2a07:de40:b2bf:1b::1106
190 - 10.100.209.142
182 - 2a07:de40:b2bf:1b::1082
181 - 10.202.0.14
179 - 10.84.228.18
161 - openqaworker17.qa.suse.cz.
157 - openqaworker18.qa.suse.cz.
139 - 10.100.203.174
137 - 10.149.212.2
136 - 10.149.213.14
133 - openqaworker16.qa.suse.cz.
126 - 10.149.211.230
115 - 2a07:de40:a102:5:b645:6ff:fee9:4f51
114 - 2a07:de40:a102:5:b645:6ff:fee9:49bd
112 - openqaworker14.qa.suse.cz.
110 - 2a07:de40:a102:5:2e60:cff:fe73:3d6
109 - 10.100.204.18
109 - 10.202.0.20
108 - mloviska.udp.ovpn2.prg.suse.de.
Actions #7

Updated by ph03nix about 1 month ago

So we have two classes of consumers that produce a high load:

  • Workers that impost overall a lot of requests
  • Some workloads that cause high local peaks. They are not workers

I think this suggests, that imposing rate limitations on the API might have larger side effects than we anticipate and makes the project a complex task, likely not worth pursuing further. I leave it here for now.

Actions #8

Updated by okurz 27 days ago

  • Target version changed from Ready to Tools - Next
Actions #9

Updated by okurz 25 days ago

  • Target version changed from Tools - Next to future
Actions

Also available in: Atom PDF