action #160239: [alert] External http responses Salt (https://openqa.suse.de/health) due to "Too many open files" after switch to nginx - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #160239

closed

openQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project (public) - coordination #108209: [epic] Reduce load on OSD

[alert] External http responses Salt (https://openqa.suse.de/health) due to "Too many open files" after switch to nginx

Added by tinita about 1 year ago. Updated about 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2024-05-12

Due date:

% Done:

0%

Estimated time:

Tags:

alert, salt, infra, nginx

Description

Observation¶

1 firing alert instance
[IMAGE]

📁 SALT › EXTERNAL HTTP RESPONSES

🔥 1 firing instances

Firing [stats.openqa-monitor.qa.suse.de]
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/b3a53df8-b7ee-48dd-9325-8a541187737f/view?orgId=1
External http responses
View alert [stats.openqa-monitor.qa.suse.de]
Summary
HTTP endpoint does not properly work
Description
An HTTP endpoint we need for proper operation delivers an http status code which indicates an issue with the service or its reachability.
Values
B=500 C=1
Labels
alertname
External http responses
grafana_folder
Salt
server
https://openqa.suse.de/health

Looking into the access og, we had 4825 500 Server errors today so far, not only for https://openqa.suse.de/health

The errorlog shows many:

2024/05/12 00:06:06 [crit] 2563#2563: accept4() failed (24: Too many open files)

The first occurrence I can find was 2024/05/07 12:02:50.

For comparison, the number of open files:

# o3
lsof | wc -l
18978
# osd
lsof | wc -l
35675

Rollback actions¶

DONE Remove silence from https://stats.openqa-monitor.qa.suse.de/alerting/silences alertname=External http responses server=https://openqa.suse.de/health

Related issues 1 (0 open — 1 closed)

Actions

#1

Updated by tinita about 1 year ago

Description updated (diff)

Actions

#2

Updated by tinita about 1 year ago

I created a 24h silence for the alert

Actions

#3

Updated by tinita about 1 year ago

Priority changed from High to Urgent

Actions

#4

Updated by okurz about 1 year ago

I suggest we look into solutions from
https://serverfault.com/a/1003630

Actions

#5

Updated by okurz about 1 year ago

Related to action #130636: high response times on osd - Try nginx on OSD size:S added

Actions

#6

Updated by okurz about 1 year ago

Status changed from New to In Progress
Assignee set to okurz

Actions

#7

Updated by livdywan about 1 year ago

Subject changed from [alert] External http responses Salt (https://openqa.suse.de/health) to [alert] External http responses Salt (https://openqa.suse.de/health) / Too many open files

Actions

#8

Updated by okurz about 1 year ago · Edited

Subject changed from [alert] External http responses Salt (https://openqa.suse.de/health) / Too many open files to [alert] External http responses Salt (https://openqa.suse.de/health) due to "Too many open files" after switch to nginx

On OSD

systemctl show nginx | grep LimitNOFILE

shows LimitNOFILE=524288

I called

for i in $(pgrep -f nginx); do grep  /proc/$i/limits; done

For now in OSD in /etc/nginx/nginx.conf I added

worker_rlimit_nofile 4096;

from https://stackoverflow.com/a/58869000/22022448

And called

host=openqa.suse.de failed_since=2024-05-09 comment=label:160239 ./openqa-advanced-retrigger-jobs

zgrep 'Too many open files' error.log* reveals that there are mentions of the error message from 20240508.

So I actually extended the above command to

host=openqa.suse.de failed_since=2024-05-07 comment=label:160239 ./openqa-advanced-retrigger-jobs

Actions

#9

Updated by okurz about 1 year ago

Tags set to infra, salt, alert, nginx
Status changed from In Progress to New
Priority changed from Urgent to Normal

Monitoring

sudo tail -n 200 -f /var/log/nginx/error.log | grep 'Too many open files'

shows no more problems, reducing prio. I plan to research what a good value would be and make that persistent in salt

Actions

#10

Updated by okurz about 1 year ago

Starting

2024/05/13 16:28:37 [crit] 22755#22755: accept4() failed (24: Too many open files)

again many more "Too many open files" occurences. But apparently the problem is that my setting disappeared.

Created
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1187

and

host=openqa.suse.de failed_since="2024-05-13 16:00" comment=label:160239 ./openqa-advanced-retrigger-jobs
export host=openqa.suse.de; export failed_since="'2024-05-13 16:00'"; for i in $(./openqa-monitor-investigation-candidates | sed 's@^[^0-9]*\([0-9]*\) .*$@\1@g'); do openqa-cli api --osd -X post jobs/$i/duplicate ; done

Actions

#11

Updated by okurz about 1 year ago · Edited

Description updated (diff)
Status changed from New to In Progress

config was invalid. Created and merged https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1188. And added a silence and rollback action. Regarding the selection of the variable value I am following https://www.cloudbees.com/blog/tuning-nginx and running a benchmark with ab -c 40 -n 50000 https://openqa.suse.de/ from my notebook over VPN. Baseline with

worker_processes  4;
worker_rlimit_nofile 4096;

events {
    worker_connections  1024;
    use epoll;
}

results

Concurrency Level:      40
Time taken for tests:   593.604 seconds
Complete requests:      50000
Failed requests:        0
Total transferred:      3541550000 bytes
HTML transferred:       3452300000 bytes
Requests per second:    84.23 [#/sec] (mean)
Time per request:       474.883 [ms] (mean)
Time per request:       11.872 [ms] (mean, across all concurrent requests)
Transfer rate:          5826.35 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       73  153  70.7    149    1346
Processing:   176  321  70.7    312    1231
Waiting:      135  235  70.6    211     926
Total:        272  474  89.1    458    1821

Percentage of the requests served within a certain time (ms)
  50%    458
  66%    478
  75%    495
  80%    509
  90%    551
  95%    593
  98%    669
  99%    772
 100%   1821 (longest request)

I guess the most important here is

Requests per second:    84.23 [#/sec] (mean)

with worker_processes 8 I get immediately error messages from ab like

SSL read failed (1) - closing connection
140336139147136:error:140E0197:SSL routines:SSL_shutdown:shutdown while in init:ssl/ssl_lib.c:2094:

so I guess 8 worker processes is too much and ab can help to check the settings nicely. With worker_processes 2; I also get "SSL handshake failed (5)." and alike.

But now actually with worker_connections 1024; we are running into a lot of errors about "1024 worker_connections are not enough". No idea why that should suddenly be necessary. I could not find any reference to that error string in the log archives of the past days.

I changed that config now to

worker_processes  4;
worker_rlimit_nofile 40960; # needs to be sufficiently higher than worker_connections but below "ulimit -Hn", which is likely 524288 on current systems

events {
    worker_connections 4096;
    use epoll;
}

no more error messages with that. Benchmark reveals 83.79 [#/sec] (mean), same as before. Stopped salt-minion on OSD to prevent the config to be overwritten again.
Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1189 . After deploy I can enable salt-minion again and make /etc/nginx/nginx.conf writable again.

Deployed, salt-minion enabled again.

worker_processes 4, worker_rlimit_nofile 8192, worker_connections 4096 yields 80.36 [#/sec] (mean) and no error.

trying
worker_processes 4, worker_rlimit_nofile 16384, worker_connections 8192 yields 83.18 [#/sec] (mean) and no error. So no significant change at all. We can stay with 8192/4096. Putting that in salt.

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1191

Actions

#12

Updated by okurz about 1 year ago

Description updated (diff)

Actions

#13

Updated by okurz about 1 year ago

Parent task set to #108209

Actions

#14

Updated by openqa_review about 1 year ago

Due date set to 2024-05-29

Setting due date based on mean cycle time of SUSE QE Tools

Actions

#15

Updated by okurz about 1 year ago

Status changed from In Progress to Feedback

no more errors appeared since yesterday. I will monitor over the next days.

Actions

#16

Updated by okurz about 1 year ago

Due date deleted (~~2024-05-29~~)
Status changed from Feedback to Resolved

No more errors on the log even today.

Actions

Also available in: Atom PDF