Project

General

Profile

action #107875

Updated by livdywan about 2 years ago

## Observation 

 We've got the alert [again](https://progress.opensuse.org/issues/107257) on March 3, 2022 09:00:40: 

 ``` 
 [Alerting] Apache Response Time alert 
 The apache response time exceeded the alert threshold. * Check the load of the web UI host * Consider restarting the openQA web UI service and/or apache Also see https://progress.opensuse.org/issues/73633 

 Metric name 
 Value 
 Min 
 18733128.83 
 ``` 

 Relevant panel: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=84 

 --- 

 Tina wrote in chat 

 > if anyone was wondering about the short high load on osd, I fetched /api/v1/jobs and it took 10 minutes 

 but that was already on Wednesday so it shouldn't have been caused this. 

    
     Further data points 
     - High CPU likely didn't affect scheduling, or we should've had other reports of it 
     - High CPU wouldn't cause a spike in failures in jobs? 
    
 ## Suggestions 
 * The apache log parsing seems to be quite heavy. Can we reduce the amount of data parsed by telegraf 
 * Reduce interval we take new data points in telegraf 
 * Extend alerting measurement period from 5m to 30m (or higher) to smooth out gaps

Back