Project

General

Profile

Actions

action #160239

closed

openQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project - coordination #108209: [epic] Reduce load on OSD

[alert] External http responses Salt (https://openqa.suse.de/health) due to "Too many open files" after switch to nginx

Added by tinita about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-05-12
Due date:
% Done:

0%

Estimated time:

Description

Observation

1 firing alert instance
[IMAGE]

📁 SALT › EXTERNAL HTTP RESPONSES

🔥 1 firing instances

Firing [stats.openqa-monitor.qa.suse.de]
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/b3a53df8-b7ee-48dd-9325-8a541187737f/view?orgId=1
External http responses
View alert [stats.openqa-monitor.qa.suse.de]
Summary
HTTP endpoint does not properly work
Description
An HTTP endpoint we need for proper operation delivers an http status code which indicates an issue with the service or its reachability.
Values
B=500  C=1 
Labels
alertname
External http responses
grafana_folder
Salt
server
https://openqa.suse.de/health

Looking into the access og, we had 4825 500 Server errors today so far, not only for https://openqa.suse.de/health

The errorlog shows many:

2024/05/12 00:06:06 [crit] 2563#2563: accept4() failed (24: Too many open files)

The first occurrence I can find was 2024/05/07 12:02:50.

For comparison, the number of open files:

# o3
lsof | wc -l
18978
# osd
lsof | wc -l
35675

Rollback actions


Related issues 1 (0 open1 closed)

Related to openQA Project - action #130636: high response times on osd - Try nginx on OSD size:SResolvedmkittler2024-05-17

Actions
Actions #1

Updated by tinita about 2 months ago

  • Description updated (diff)
Actions #2

Updated by tinita about 2 months ago

I created a 24h silence for the alert

Actions #3

Updated by tinita about 2 months ago

  • Priority changed from High to Urgent
Actions #4

Updated by okurz about 2 months ago

I suggest we look into solutions from
https://serverfault.com/a/1003630

Actions #5

Updated by okurz about 2 months ago

  • Related to action #130636: high response times on osd - Try nginx on OSD size:S added
Actions #6

Updated by okurz about 2 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
Actions #7

Updated by livdywan about 2 months ago

  • Subject changed from [alert] External http responses Salt (https://openqa.suse.de/health) to [alert] External http responses Salt (https://openqa.suse.de/health) / Too many open files
Actions #8

Updated by okurz about 2 months ago · Edited

  • Subject changed from [alert] External http responses Salt (https://openqa.suse.de/health) / Too many open files to [alert] External http responses Salt (https://openqa.suse.de/health) due to "Too many open files" after switch to nginx

On OSD

systemctl show nginx | grep LimitNOFILE

shows LimitNOFILE=524288

I called

for i in $(pgrep -f nginx); do grep  /proc/$i/limits; done

For now in OSD in /etc/nginx/nginx.conf I added

worker_rlimit_nofile 4096;

from https://stackoverflow.com/a/58869000/22022448

And called

host=openqa.suse.de failed_since=2024-05-09 comment=label:160239 ./openqa-advanced-retrigger-jobs

zgrep 'Too many open files' error.log* reveals that there are mentions of the error message from 20240508.

So I actually extended the above command to

host=openqa.suse.de failed_since=2024-05-07 comment=label:160239 ./openqa-advanced-retrigger-jobs
Actions #9

Updated by okurz about 2 months ago

  • Tags set to infra, salt, alert, nginx
  • Status changed from In Progress to New
  • Priority changed from Urgent to Normal

Monitoring

sudo tail -n 200 -f /var/log/nginx/error.log | grep 'Too many open files'

shows no more problems, reducing prio. I plan to research what a good value would be and make that persistent in salt

Actions #10

Updated by okurz about 2 months ago

Starting

2024/05/13 16:28:37 [crit] 22755#22755: accept4() failed (24: Too many open files)

again many more "Too many open files" occurences. But apparently the problem is that my setting disappeared.

Created
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1187

and

host=openqa.suse.de failed_since="2024-05-13 16:00" comment=label:160239 ./openqa-advanced-retrigger-jobs
export host=openqa.suse.de; export failed_since="'2024-05-13 16:00'"; for i in $(./openqa-monitor-investigation-candidates | sed 's@^[^0-9]*\([0-9]*\) .*$@\1@g'); do openqa-cli api --osd -X post jobs/$i/duplicate ; done
Actions #11

Updated by okurz about 2 months ago · Edited

  • Description updated (diff)
  • Status changed from New to In Progress

config was invalid. Created and merged https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1188. And added a silence and rollback action. Regarding the selection of the variable value I am following https://www.cloudbees.com/blog/tuning-nginx and running a benchmark with ab -c 40 -n 50000 https://openqa.suse.de/ from my notebook over VPN. Baseline with

worker_processes  4;
worker_rlimit_nofile 4096;

events {
    worker_connections  1024;
    use epoll;
}

results

Concurrency Level:      40
Time taken for tests:   593.604 seconds
Complete requests:      50000
Failed requests:        0
Total transferred:      3541550000 bytes
HTML transferred:       3452300000 bytes
Requests per second:    84.23 [#/sec] (mean)
Time per request:       474.883 [ms] (mean)
Time per request:       11.872 [ms] (mean, across all concurrent requests)
Transfer rate:          5826.35 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       73  153  70.7    149    1346
Processing:   176  321  70.7    312    1231
Waiting:      135  235  70.6    211     926
Total:        272  474  89.1    458    1821

Percentage of the requests served within a certain time (ms)
  50%    458
  66%    478
  75%    495
  80%    509
  90%    551
  95%    593
  98%    669
  99%    772
 100%   1821 (longest request)

I guess the most important here is

Requests per second:    84.23 [#/sec] (mean)

with worker_processes 8 I get immediately error messages from ab like

SSL read failed (1) - closing connection
140336139147136:error:140E0197:SSL routines:SSL_shutdown:shutdown while in init:ssl/ssl_lib.c:2094:

so I guess 8 worker processes is too much and ab can help to check the settings nicely. With worker_processes 2; I also get "SSL handshake failed (5)." and alike.

But now actually with worker_connections 1024; we are running into a lot of errors about "1024 worker_connections are not enough". No idea why that should suddenly be necessary. I could not find any reference to that error string in the log archives of the past days.

I changed that config now to

worker_processes  4;
worker_rlimit_nofile 40960; # needs to be sufficiently higher than worker_connections but below "ulimit -Hn", which is likely 524288 on current systems

events {
    worker_connections 4096;
    use epoll;
}

no more error messages with that. Benchmark reveals 83.79 [#/sec] (mean), same as before. Stopped salt-minion on OSD to prevent the config to be overwritten again.
Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1189 . After deploy I can enable salt-minion again and make /etc/nginx/nginx.conf writable again.

Deployed, salt-minion enabled again.

worker_processes 4, worker_rlimit_nofile 8192, worker_connections 4096 yields 80.36 #/sec and no error.

trying
worker_processes 4, worker_rlimit_nofile 16384, worker_connections 8192 yields 83.18 #/sec and no error. So no significant change at all. We can stay with 8192/4096. Putting that in salt.

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1191

Actions #12

Updated by okurz about 2 months ago

  • Description updated (diff)
Actions #13

Updated by okurz about 2 months ago

  • Parent task set to #108209
Actions #14

Updated by openqa_review about 2 months ago

  • Due date set to 2024-05-29

Setting due date based on mean cycle time of SUSE QE Tools

Actions #15

Updated by okurz about 2 months ago

  • Status changed from In Progress to Feedback

no more errors appeared since yesterday. I will monitor over the next days.

Actions #16

Updated by okurz about 1 month ago

  • Due date deleted (2024-05-29)
  • Status changed from Feedback to Resolved

No more errors on the log even today.

Actions

Also available in: Atom PDF