Project

General

Profile

action #178204

Updated by tinita about 4 hours ago

 
 ## Observation 
 https://monitor.qa.suse.de/d/7W06NBWGk/job-age?orgId=1&from=2025-03-03T02:45:29.209Z&to=2025-03-03T06:58:26.736Z&timezone=UTC 

 Relevant panel: https://monitor.qa.suse.de/d/7W06NBWGk/job-age?viewPanel=panel-5&orgId=1&from=2025-03-01T19%3A35%3A43.674Z&to=2025-03-04T06%3A19%3A31.656Z&timezone=utc 

 Based on observations there are recurring alerts indicating long wait times before execution. 

 gpuliti preferred to not silence the alert since is not that common yet, at least in the last week, but we should try to optimize test scheduling to reduce waiting times. 

 The main offender seem to be jobs with a worker class config that can never be picked up as there are no workers for "qemu_x86_64,intel,tap", scheduled by "QE Security". Security" 

 ## Acceptance Criteria 
 * **AC1:** There is an understanding to remove/change the alert or have another workflow to handle the alert 

 ## Suggestions 
 * are there any bottlenecks? Answer: No, there aren't. We need to discuss expectations. 
 The main problem is  
 * Also see similar stories from the past #73174 
 * Report new feature requests to detect jobs that can not be picked up by any current matching worker class and block on that. After that we can cancel such jobs earlier and still keep a sensible alert for jobs that would match current workers but are just delayed for long 

 ## Rollback actions 
 * Remove silence from https://monitor.qa.suse.de/alerting/silences?alertmanager=grafana `alertname=Job age (scheduled) (median) alert`

Back