action #95443
Updated by livdywan over 3 years ago
## Observation I observed several unhandled alerts on Grafana on Sunday and Monday. ``` [Alerting] Job age (scheduled) (max) alert Jobs not scheduled for 4 days (345600s). Possible reasons: * There are no online workers for selected scheduled jobs, misconfiguration on the side of tests likely See https://progress.opensuse.org/issues/73174#note-2 for an explanation of the selection of the specific value Metric name Value 50% percentile (max) 501773.500 ``` [click](http://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?tab=alert&viewPanel=2&orgId=1) ``` [Alerting] Job age (scheduled) (median) alert Check for overall decrease of "time to start". Possible reasons for regression: * Not enough ressources * Too many tests scheduled due to misconfiguration 2020-11-27: Alert limit set to 259200s = 3d, see https://progress.opensuse.org/issues/73174#note-2 about the decision Related progress issue: https://progress.opensuse.org/issues/65975 Metric name Value 50% percentile (median) 501113.500 ``` [click](http://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?tab=alert&viewPanel=5&orgId=1) ``` [Alerting] Job age (scheduled) (max) alert Jobs not scheduled for 4 days (345600s). Possible reasons: * There are no online workers for selected scheduled jobs, misconfiguration on the side of tests likely See https://progress.opensuse.org/issues/73174#note-2 for an explanation of the selection of the specific value Metric name Value 50% percentile (max) 954811.000 ``` [click](http://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?tab=alert&viewPanel=2&orgId=1) [No Data] Incomplete jobs (not restarted) of last 24h alert [click](http://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?tab=alert&viewPanel=17&orgId=1) ## Acceptance criteria - **AC1**: The cause of the alerts is clear or a follow-up ticket is filed with a feature request to have the necessary details next time ## Suggestions - Look at the alert history in Grafana - Look at all tests and check for cancelled jobs or removed workers - Other issues handling these alerts recently: #93612 and #92110 according to a quick search.