action #160239
closedopenQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project (public) - coordination #108209: [epic] Reduce load on OSD
[alert] External http responses Salt (https://openqa.suse.de/health) due to "Too many open files" after switch to nginx
0%
Description
Observation¶
1 firing alert instance
[IMAGE]
📁 SALT › EXTERNAL HTTP RESPONSES
🔥 1 firing instances
Firing [stats.openqa-monitor.qa.suse.de]
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/b3a53df8-b7ee-48dd-9325-8a541187737f/view?orgId=1
External http responses
View alert [stats.openqa-monitor.qa.suse.de]
Summary
HTTP endpoint does not properly work
Description
An HTTP endpoint we need for proper operation delivers an http status code which indicates an issue with the service or its reachability.
Values
B=500 C=1
Labels
alertname
External http responses
grafana_folder
Salt
server
https://openqa.suse.de/health
Looking into the access og, we had 4825 500 Server errors today so far, not only for https://openqa.suse.de/health
The errorlog shows many:
2024/05/12 00:06:06 [crit] 2563#2563: accept4() failed (24: Too many open files)
The first occurrence I can find was 2024/05/07 12:02:50.
For comparison, the number of open files:
# o3
lsof | wc -l
18978
# osd
lsof | wc -l
35675
Rollback actions¶
- DONE Remove silence from https://stats.openqa-monitor.qa.suse.de/alerting/silences
alertname=External http responses server=https://openqa.suse.de/health
Updated by okurz 7 months ago
I suggest we look into solutions from
https://serverfault.com/a/1003630
Updated by okurz 7 months ago
- Related to action #130636: high response times on osd - Try nginx on OSD size:S added
Updated by okurz 7 months ago · Edited
- Subject changed from [alert] External http responses Salt (https://openqa.suse.de/health) / Too many open files to [alert] External http responses Salt (https://openqa.suse.de/health) due to "Too many open files" after switch to nginx
On OSD
systemctl show nginx | grep LimitNOFILE
shows LimitNOFILE=524288
I called
for i in $(pgrep -f nginx); do grep /proc/$i/limits; done
For now in OSD in /etc/nginx/nginx.conf I added
worker_rlimit_nofile 4096;
from https://stackoverflow.com/a/58869000/22022448
And called
host=openqa.suse.de failed_since=2024-05-09 comment=label:160239 ./openqa-advanced-retrigger-jobs
zgrep 'Too many open files' error.log*
reveals that there are mentions of the error message from 20240508.
So I actually extended the above command to
host=openqa.suse.de failed_since=2024-05-07 comment=label:160239 ./openqa-advanced-retrigger-jobs
Updated by okurz 7 months ago
- Tags set to infra, salt, alert, nginx
- Status changed from In Progress to New
- Priority changed from Urgent to Normal
Monitoring
sudo tail -n 200 -f /var/log/nginx/error.log | grep 'Too many open files'
shows no more problems, reducing prio. I plan to research what a good value would be and make that persistent in salt
Updated by okurz 7 months ago
Starting
2024/05/13 16:28:37 [crit] 22755#22755: accept4() failed (24: Too many open files)
again many more "Too many open files" occurences. But apparently the problem is that my setting disappeared.
Created
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1187
and
host=openqa.suse.de failed_since="2024-05-13 16:00" comment=label:160239 ./openqa-advanced-retrigger-jobs
export host=openqa.suse.de; export failed_since="'2024-05-13 16:00'"; for i in $(./openqa-monitor-investigation-candidates | sed 's@^[^0-9]*\([0-9]*\) .*$@\1@g'); do openqa-cli api --osd -X post jobs/$i/duplicate ; done
Updated by okurz 7 months ago · Edited
- Description updated (diff)
- Status changed from New to In Progress
config was invalid. Created and merged https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1188. And added a silence and rollback action. Regarding the selection of the variable value I am following https://www.cloudbees.com/blog/tuning-nginx and running a benchmark with ab -c 40 -n 50000 https://openqa.suse.de/
from my notebook over VPN. Baseline with
worker_processes 4;
worker_rlimit_nofile 4096;
events {
worker_connections 1024;
use epoll;
}
results
Concurrency Level: 40
Time taken for tests: 593.604 seconds
Complete requests: 50000
Failed requests: 0
Total transferred: 3541550000 bytes
HTML transferred: 3452300000 bytes
Requests per second: 84.23 [#/sec] (mean)
Time per request: 474.883 [ms] (mean)
Time per request: 11.872 [ms] (mean, across all concurrent requests)
Transfer rate: 5826.35 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 73 153 70.7 149 1346
Processing: 176 321 70.7 312 1231
Waiting: 135 235 70.6 211 926
Total: 272 474 89.1 458 1821
Percentage of the requests served within a certain time (ms)
50% 458
66% 478
75% 495
80% 509
90% 551
95% 593
98% 669
99% 772
100% 1821 (longest request)
I guess the most important here is
Requests per second: 84.23 [#/sec] (mean)
with worker_processes 8
I get immediately error messages from ab like
SSL read failed (1) - closing connection
140336139147136:error:140E0197:SSL routines:SSL_shutdown:shutdown while in init:ssl/ssl_lib.c:2094:
so I guess 8 worker processes is too much and ab
can help to check the settings nicely. With worker_processes 2;
I also get "SSL handshake failed (5)." and alike.
But now actually with worker_connections 1024;
we are running into a lot of errors about "1024 worker_connections are not enough". No idea why that should suddenly be necessary. I could not find any reference to that error string in the log archives of the past days.
I changed that config now to
worker_processes 4;
worker_rlimit_nofile 40960; # needs to be sufficiently higher than worker_connections but below "ulimit -Hn", which is likely 524288 on current systems
events {
worker_connections 4096;
use epoll;
}
no more error messages with that. Benchmark reveals 83.79 [#/sec] (mean)
, same as before. Stopped salt-minion on OSD to prevent the config to be overwritten again.
Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1189 . After deploy I can enable salt-minion again and make /etc/nginx/nginx.conf writable again.
Deployed, salt-minion enabled again.
worker_processes 4, worker_rlimit_nofile 8192, worker_connections 4096 yields 80.36 #/sec and no error.
trying
worker_processes 4, worker_rlimit_nofile 16384, worker_connections 8192 yields 83.18 #/sec and no error. So no significant change at all. We can stay with 8192/4096. Putting that in salt.
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1191
Updated by openqa_review 7 months ago
- Due date set to 2024-05-29
Setting due date based on mean cycle time of SUSE QE Tools