action #160239
closedopenQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project (public) - coordination #108209: [epic] Reduce load on OSD
[alert] External http responses Salt (https://openqa.suse.de/health) due to "Too many open files" after switch to nginx
0%
Description
Observation¶
1 firing alert instance
[IMAGE]
📁 SALT › EXTERNAL HTTP RESPONSES
🔥 1 firing instances
Firing [stats.openqa-monitor.qa.suse.de]
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/b3a53df8-b7ee-48dd-9325-8a541187737f/view?orgId=1
External http responses
View alert [stats.openqa-monitor.qa.suse.de]
Summary
HTTP endpoint does not properly work
Description
An HTTP endpoint we need for proper operation delivers an http status code which indicates an issue with the service or its reachability.
Values
B=500 C=1
Labels
alertname
External http responses
grafana_folder
Salt
server
https://openqa.suse.de/health
Looking into the access og, we had 4825 500 Server errors today so far, not only for https://openqa.suse.de/health
The errorlog shows many:
2024/05/12 00:06:06 [crit] 2563#2563: accept4() failed (24: Too many open files)
The first occurrence I can find was 2024/05/07 12:02:50.
For comparison, the number of open files:
# o3
lsof | wc -l
18978
# osd
lsof | wc -l
35675
Rollback actions¶
- DONE Remove silence from https://stats.openqa-monitor.qa.suse.de/alerting/silences
alertname=External http responses server=https://openqa.suse.de/health