Project

General

Profile

action #102437

Job age alert median followed by max size:S

Added by cdywan 2 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2021-07-28
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

Job age (scheduled) (median) is alerting:

50% percentile (median) 392527.500

Job age (scheduled) (max) is alerting:

50% percentile (max) 402487.500

Suggestion

  • Look at scheduled jobs from yesterday e.g. audit log
  • Check currently running/ scheduled jobs
  • Look at current queued and cancelled jobs

Related issues

Copied to openQA Project - action #102440: openqa-review pipeline failed with assert self.issue_type == "bugzilla"Resolved2021-07-282021-11-30

History

#1 Updated by cdywan 2 months ago

  • Subject changed from Job age alert mediam followed by max to Job age alert median followed by max

#2 Updated by cdywan 2 months ago

  • Copied to action #102440: openqa-review pipeline failed with assert self.issue_type == "bugzilla" added

#3 Updated by okurz 2 months ago

  • Priority changed from Normal to Urgent

#4 Updated by cdywan 2 months ago

  • Subject changed from Job age alert median followed by max to Job age alert median followed by max size:S
  • Description updated (diff)
  • Status changed from New to Workable

#5 Updated by mkittler 2 months ago

  • Assignee set to mkittler

#6 Updated by mkittler 2 months ago

The feature for auto-cancelling scheduled jobs has been deployed before the alert fired but so far it has only cancelled a few jobs (only one since the alert has been fired):

openqa=> select id, state, result, reason, t_created, t_finished from jobs where reason like '%scheduled for more than%' order by t_created desc limit 100;                                                                                                                                     
   id    |   state   |  result   |             reason             |      t_created      |     t_finished     
---------+-----------+-----------+--------------------------------+---------------------+--------------------
 7630506 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-08 11:33:22 | 2021-11-16 11:34:46
 7605743 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-04 16:58:09 | 2021-11-12 16:58:23
 7599705 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-03 17:02:51 | 2021-11-11 17:03:18
 7599703 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-03 17:02:50 | 2021-11-11 17:03:18
 7599698 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-03 17:02:46 | 2021-11-11 17:03:17
 7599694 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-03 17:02:44 | 2021-11-11 17:03:17
 7599691 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-03 17:02:41 | 2021-11-11 17:03:17
 7599669 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-03 17:01:31 | 2021-11-11 17:01:47
 7599667 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-03 17:01:30 | 2021-11-11 17:01:47
(9 Zeilen)

So the auto-cancelling didn't have much impact here.


The auto-cancelling only affects jobs older than 7 days. The alert is already firing for jobs older than 3 days. This explains the low impact of the feature. Maybe we want to bring this into accordance?

#7 Updated by mkittler 2 months ago

  • Status changed from Workable to Feedback

It looks like most of these jobs had a special worker class (to investigate #101030):

openqa=> select count(id), (select value from job_settings where job_id = jobs.id and key = 'WORKER_CLASS' limit 1) as worker_class from jobs where t_created < '2021-11-10' and t_finished > '2021-11-14' group by worker_class;
 count |          worker_class           
-------+---------------------------------
   109 | qemu_aarch64_unstable_poo101030
     1 | s390x-kvm-sle15
(2 Zeilen)

(Changing the condition for t_created "+-" a day doesn't really affect the outcome of the query.)

So it looks not like a general problem. I'm not sure whether we can improve the alert to avoid firing in those cases. Maybe we can just resolve the ticket after we gathered these findings?

#8 Updated by tinita 2 months ago

The alert is not reacting to a fixed number of days, but a certain percentage, so it depends a lot on how many jobs are scheduled in total, and how many are "old".
So the auto-cancelling doesn't really match what the alert is complaining about.
Also no idea how to improve the alert (I still don't fully understand the Grafana data although Oli already explained it to me :)

I would close this.

#9 Updated by mkittler 2 months ago

I know that the alert is only looking if the median age exceeds 3 days (or 4 days in case of the "max" alert). So it is indeed different from the auto-cancelling feature which looks at individual jobs. I just wanted to bring it up for discussion whether we might want to adjust such alerts now since we have the auto-cancelling feature but that's likely out of scope here.

I also don't know how to improve the alert. Making it smart enough to figure that jobs like these are special and it is ok if they are processed slower than usual is likely quite complicated.

#10 Updated by okurz 2 months ago

  • Status changed from Feedback to Resolved

Yes, that's fine

Also available in: Atom PDF