Project

General

Profile

Actions

action #102437

closed

Job age alert median followed by max size:S

Added by livdywan over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-07-28
Due date:
% Done:

0%

Estimated time:

Description

Observation

Job age (scheduled) (median) is alerting:

50% percentile (median) 392527.500

Job age (scheduled) (max) is alerting:

50% percentile (max) 402487.500

Suggestion

  • Look at scheduled jobs from yesterday e.g. audit log
  • Check currently running/ scheduled jobs
  • Look at current queued and cancelled jobs

Related issues 1 (0 open1 closed)

Copied to openQA Project - action #102440: openqa-review pipeline failed with assert self.issue_type == "bugzilla"Resolvedmkittler2021-07-282021-11-30

Actions
Actions #1

Updated by livdywan over 2 years ago

  • Subject changed from Job age alert mediam followed by max to Job age alert median followed by max
Actions #2

Updated by livdywan over 2 years ago

  • Copied to action #102440: openqa-review pipeline failed with assert self.issue_type == "bugzilla" added
Actions #3

Updated by okurz over 2 years ago

  • Priority changed from Normal to Urgent
Actions #4

Updated by livdywan over 2 years ago

  • Subject changed from Job age alert median followed by max to Job age alert median followed by max size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by mkittler over 2 years ago

  • Assignee set to mkittler
Actions #6

Updated by mkittler over 2 years ago

The feature for auto-cancelling scheduled jobs has been deployed before the alert fired but so far it has only cancelled a few jobs (only one since the alert has been fired):

openqa=> select id, state, result, reason, t_created, t_finished from jobs where reason like '%scheduled for more than%' order by t_created desc limit 100;                                                                                                                                     
   id    |   state   |  result   |             reason             |      t_created      |     t_finished     
---------+-----------+-----------+--------------------------------+---------------------+--------------------
 7630506 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-08 11:33:22 | 2021-11-16 11:34:46
 7605743 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-04 16:58:09 | 2021-11-12 16:58:23
 7599705 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-03 17:02:51 | 2021-11-11 17:03:18
 7599703 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-03 17:02:50 | 2021-11-11 17:03:18
 7599698 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-03 17:02:46 | 2021-11-11 17:03:17
 7599694 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-03 17:02:44 | 2021-11-11 17:03:17
 7599691 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-03 17:02:41 | 2021-11-11 17:03:17
 7599669 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-03 17:01:31 | 2021-11-11 17:01:47
 7599667 | cancelled | obsoleted | scheduled for more than 7 days | 2021-11-03 17:01:30 | 2021-11-11 17:01:47
(9 Zeilen)

So the auto-cancelling didn't have much impact here.


The auto-cancelling only affects jobs older than 7 days. The alert is already firing for jobs older than 3 days. This explains the low impact of the feature. Maybe we want to bring this into accordance?

Actions #7

Updated by mkittler over 2 years ago

  • Status changed from Workable to Feedback

It looks like most of these jobs had a special worker class (to investigate #101030):

openqa=> select count(id), (select value from job_settings where job_id = jobs.id and key = 'WORKER_CLASS' limit 1) as worker_class from jobs where t_created < '2021-11-10' and t_finished > '2021-11-14' group by worker_class;
 count |          worker_class           
-------+---------------------------------
   109 | qemu_aarch64_unstable_poo101030
     1 | s390x-kvm-sle15
(2 Zeilen)

(Changing the condition for t_created "+-" a day doesn't really affect the outcome of the query.)

So it looks not like a general problem. I'm not sure whether we can improve the alert to avoid firing in those cases. Maybe we can just resolve the ticket after we gathered these findings?

Actions #8

Updated by tinita over 2 years ago

The alert is not reacting to a fixed number of days, but a certain percentage, so it depends a lot on how many jobs are scheduled in total, and how many are "old".
So the auto-cancelling doesn't really match what the alert is complaining about.
Also no idea how to improve the alert (I still don't fully understand the Grafana data although Oli already explained it to me :)

I would close this.

Actions #9

Updated by mkittler over 2 years ago

I know that the alert is only looking if the median age exceeds 3 days (or 4 days in case of the "max" alert). So it is indeed different from the auto-cancelling feature which looks at individual jobs. I just wanted to bring it up for discussion whether we might want to adjust such alerts now since we have the auto-cancelling feature but that's likely out of scope here.

I also don't know how to improve the alert. Making it smart enough to figure that jobs like these are special and it is ok if they are processed slower than usual is likely quite complicated.

Actions #10

Updated by okurz over 2 years ago

  • Status changed from Feedback to Resolved

Yes, that's fine

Actions

Also available in: Atom PDF