action #102437
closedJob age alert median followed by max size:S
Description
Observation¶
Job age (scheduled) (median) is alerting:
50% percentile (median) 392527.500
Job age (scheduled) (max) is alerting:
50% percentile (max) 402487.500
Suggestion¶
- Look at scheduled jobs from yesterday e.g. audit log
- Check currently running/ scheduled jobs
- Look at current queued and cancelled jobs
Updated by livdywan about 3 years ago
- Subject changed from Job age alert mediam followed by max to Job age alert median followed by max
Updated by livdywan about 3 years ago
- Copied to action #102440: openqa-review pipeline failed with assert self.issue_type == "bugzilla" added
Updated by livdywan about 3 years ago
- Subject changed from Job age alert median followed by max to Job age alert median followed by max size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler about 3 years ago
The feature for auto-cancelling scheduled jobs has been deployed before the alert fired but so far it has only cancelled a few jobs (only one since the alert has been fired):
openqa=> select id, state, result, reason, t_created, t_finished from jobs where reason like '%scheduled for more than%' order by t_created desc limit 100;
id | state | result | reason | t_created | t_finished
---------+-----------+-----------+--------------------------------+---------------------+--------------------
7630506 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-08 11:33:22 | 2021-11-16 11:34:46
7605743 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-04 16:58:09 | 2021-11-12 16:58:23
7599705 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-03 17:02:51 | 2021-11-11 17:03:18
7599703 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-03 17:02:50 | 2021-11-11 17:03:18
7599698 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-03 17:02:46 | 2021-11-11 17:03:17
7599694 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-03 17:02:44 | 2021-11-11 17:03:17
7599691 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-03 17:02:41 | 2021-11-11 17:03:17
7599669 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-03 17:01:31 | 2021-11-11 17:01:47
7599667 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-03 17:01:30 | 2021-11-11 17:01:47
(9 Zeilen)
So the auto-cancelling didn't have much impact here.
The auto-cancelling only affects jobs older than 7 days. The alert is already firing for jobs older than 3 days. This explains the low impact of the feature. Maybe we want to bring this into accordance?
Updated by mkittler about 3 years ago
- Status changed from Workable to Feedback
It looks like most of these jobs had a special worker class (to investigate #101030):
openqa=> select count(id), (select value from job_settings where job_id = jobs.id and key = 'WORKER_CLASS' limit 1) as worker_class from jobs where t_created < '2021-11-10' and t_finished > '2021-11-14' group by worker_class;
count | worker_class
-------+---------------------------------
109 | qemu_aarch64_unstable_poo101030
1 | s390x-kvm-sle15
(2 Zeilen)
(Changing the condition for t_created
"+-" a day doesn't really affect the outcome of the query.)
So it looks not like a general problem. I'm not sure whether we can improve the alert to avoid firing in those cases. Maybe we can just resolve the ticket after we gathered these findings?
Updated by tinita about 3 years ago
The alert is not reacting to a fixed number of days, but a certain percentage, so it depends a lot on how many jobs are scheduled in total, and how many are "old".
So the auto-cancelling doesn't really match what the alert is complaining about.
Also no idea how to improve the alert (I still don't fully understand the Grafana data although Oli already explained it to me :)
I would close this.
Updated by mkittler about 3 years ago
I know that the alert is only looking if the median age exceeds 3 days (or 4 days in case of the "max" alert). So it is indeed different from the auto-cancelling feature which looks at individual jobs. I just wanted to bring it up for discussion whether we might want to adjust such alerts now since we have the auto-cancelling feature but that's likely out of scope here.
I also don't know how to improve the alert. Making it smart enough to figure that jobs like these are special and it is ok if they are processed slower than usual is likely quite complicated.