Project

General

Profile

action #160239

Updated by okurz 2 months ago

## Observation 
 1 firing alert instance 
 [IMAGE] 

 📁 SALT › EXTERNAL HTTP RESPONSES 

   🔥 1 firing instances 

 Firing [stats.openqa-monitor.qa.suse.de] 
 http://stats.openqa-monitor.qa.suse.de/alerting/grafana/b3a53df8-b7ee-48dd-9325-8a541187737f/view?orgId=1 
 External http responses 
 View alert [stats.openqa-monitor.qa.suse.de] 
 Summary 
 HTTP endpoint does not properly work 
 Description 
 An HTTP endpoint we need for proper operation delivers an http status code which indicates an issue with the service or its reachability. 
 Values 
 B=500  C=1  
 Labels 
 alertname 
 External http responses 
 grafana_folder 
 Salt 
 server 
 https://openqa.suse.de/health 

 Looking into the access og, we had 4825 500 Server errors today so far, not only for https://openqa.suse.de/health 

 The errorlog shows many: 
 ``` 
 2024/05/12 00:06:06 [crit] 2563#2563: accept4() failed (24: Too many open files) 
 ``` 
 The first occurrence I can find was 2024/05/07 12:02:50. 

 For comparison, the number of open files: 
 ``` 
 # o3 
 lsof | wc -l 
 18978 
 # osd 
 lsof | wc -l 
 35675 
 ``` 

 ## Rollback actions 
 * *DONE* Remove silence from https://stats.openqa-monitor.qa.suse.de/alerting/silences `alertname=External http responses server=https://openqa.suse.de/health`

Back