Project

General

Profile

action #95443

Updated by livdywan almost 3 years ago

## Observation 

 I observed several unhandled alerts on Grafana on Sunday and Monday. 

 ``` 
 [Alerting] Job age (scheduled) (max) alert 

 Jobs not scheduled for 4 days (345600s). Possible reasons: * There are no online workers for selected scheduled jobs, misconfiguration on the side of tests likely See https://progress.opensuse.org/issues/73174#note-2 for an explanation of the selection of the specific value 
 Metric name 
	
 Value 
 50% percentile (max) 
	
 501773.500 
 ``` 
 [click](http://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?tab=alert&viewPanel=2&orgId=1) 

 ``` 
 [Alerting] Job age (scheduled) (median) alert 

 Check for overall decrease of "time to start". Possible reasons for regression: * Not enough ressources * Too many tests scheduled due to misconfiguration 2020-11-27: Alert limit set to 259200s = 3d, see https://progress.opensuse.org/issues/73174#note-2 about the decision Related progress issue: https://progress.opensuse.org/issues/65975 
 Metric name 
	
 Value 
 50% percentile (median) 
	
 501113.500 
 ``` 
 [click](http://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?tab=alert&viewPanel=5&orgId=1) 

 ``` 
 [Alerting] Job age (scheduled) (max) alert 

 Jobs not scheduled for 4 days (345600s). Possible reasons: * There are no online workers for selected scheduled jobs, misconfiguration on the side of tests likely See https://progress.opensuse.org/issues/73174#note-2 for an explanation of the selection of the specific value 
 Metric name 
	
 Value 
 50% percentile (max) 
	
 954811.000 
 ``` 
 [click](http://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?tab=alert&viewPanel=2&orgId=1) 

 [No Data] Incomplete jobs (not restarted) of last 24h alert [click](http://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?tab=alert&viewPanel=17&orgId=1) 

 ## Acceptance criteria 
 - **AC1**: The cause of the alerts is clear or a follow-up ticket is filed with a feature request to have the necessary details next time 

 ## Suggestions 
 - Look at the alert history in Grafana 
 - Look at all tests and check for cancelled jobs or removed workers 
 - Other issues handling these alerts recently: #93612 and #92110 according to a quick search.

Back