action #160239
Updated by okurz 7 months ago
## Observation
1 firing alert instance
[IMAGE]
📁 SALT › EXTERNAL HTTP RESPONSES
🔥 1 firing instances
Firing [stats.openqa-monitor.qa.suse.de]
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/b3a53df8-b7ee-48dd-9325-8a541187737f/view?orgId=1
External http responses
View alert [stats.openqa-monitor.qa.suse.de]
Summary
HTTP endpoint does not properly work
Description
An HTTP endpoint we need for proper operation delivers an http status code which indicates an issue with the service or its reachability.
Values
B=500 C=1
Labels
alertname
External http responses
grafana_folder
Salt
server
https://openqa.suse.de/health
Looking into the access og, we had 4825 500 Server errors today so far, not only for https://openqa.suse.de/health
The errorlog shows many:
```
2024/05/12 00:06:06 [crit] 2563#2563: accept4() failed (24: Too many open files)
```
The first occurrence I can find was 2024/05/07 12:02:50.
For comparison, the number of open files:
```
# o3
lsof | wc -l
18978
# osd
lsof | wc -l
35675
```
## Rollback actions
* Remove silence from https://stats.openqa-monitor.qa.suse.de/alerting/silences `alertname=External http responses server=https://openqa.suse.de/health`
Back