Project

General

Profile

action #98562

Cancel jobs with invalid WORKER_CLASS after a timeout

Added by cdywan 3 months ago. Updated about 2 months ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2021-09-13
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Motivation

tinita was investigating job age alerts and found a job with a WORKER_CLASS that doesn't match any workers. This was traced by to asmorodskyi who then identified the change which was incorrect use of +WORKER_CLASS (+WORKER_CLASS is combined rather than overridden).

Regardless of what caused this, instead of a developer monitoring jobs and figuring out what happened, we should have openQA cancel unmatch.

Acceptance criteria

  • AC1: Cancel unmatched jobs after a timeout
  • AC2: File a ticket

Workaround

Have a person monitor alerts and investigate jobs that never run, cancel the job and file a new ticket.


Related issues

Copied to openQA Project - action #100973: Cancel any scheduled jobs after a configurable timeout, e.g. days size:MResolved2021-09-13

Copied to openQA Project - coordination #102864: [epic] Inform openQA webUI users about potential worker class mismatch or long delaysNew2021-09-13

History

#1 Updated by tinita 3 months ago

I would like to note several things.

tinita was investigating job age alerts

No, I wasn't, we didn't have an alert today. We had a Job Age alert several weeks/months ago, and it was caused by a job with a WORKER_CLASS which didn't have a matching worker.

To avoid alerts, I started looking into the Grafana board regularly to cancel such jobs before they grow too old and maybe hide real problems in Grafana.

and found a job with a WORKER_CLASS that doesn't match any workers.

The mentioned WORKER_CLASS qemu_x86_64,pc_azure was used in several jobs since I started to monitor Grafana, and I cancelled them all and mentioned asmorodskyi in our testing channel.

This was traced by to asmorodskyi who then identified the change which was incorrect use of +WORKER_CLASS (+WORKER_CLASS is combined rather than overridden).

Today I saw such a job again, and asmorodskyi sent us a link to a Gitlab commit. Both the job and the Gitlab MR are 4 days old. The time of the job is not clear yet because of unknown timezone.

So, through the last weeks I saw several jobs with WORKER_CLASS=qemu_x86_64,pc_azure. How can a commit from 4 days ago be responsible for that?

#2 Updated by okurz 3 months ago

  • Priority changed from Normal to Low
  • Target version set to Ready

I think the feature request is a great idea. As an exception I am adding it immediately to the backlog (though with low prio) because I think it should help us with endlessly stuck jobs preventing further alerts with some good hint to the user instead

#3 Updated by osukup about 2 months ago

how differentiate between non-existent WORKER_CLASS and jobs with low prio when the queue is full for a long time?

#4 Updated by cdywan about 2 months ago

osukup wrote:

how differentiate between non-existent WORKER_CLASS and jobs with low prio when the queue is full for a long time?

I think technically you don't need to check it. And the AC just says Cancel unmatched jobs after a timeout. If the job is not getting picked up after n seconds, cancel it.

#5 Updated by osukup about 2 months ago

I think technically you don't need to check it. And the AC just says Cancel unmatched jobs after a timeout. If the job is not getting picked up after n seconds, cancel it.

sounds right,

but for something with prior 30 should be timeout different than for job with prio 160

something like formula (prio/100) * BASE_TIMEOUT where BASE_TIMEOUT is about 6 days ?

#6 Updated by okurz about 2 months ago

For a start I would make it really simple and define a simple hard criterion: The time, nothing else. Based on how that suits us we can eventually think about incorporating priority or other factors.

#7 Updated by nicksinger about 2 months ago

I faced a similar issue in my alert duty today again. I really think this feature would be useful.

Additionally what comes to my mind: why can't we catch an invalid WORKER_CLASS even before the job gets posted/cloned? Also introduce some kind of override to catch special edge-cases. I think it could catch 99% of the mistakes before they end up on openQA.

#8 Updated by okurz about 2 months ago

nicksinger wrote:

Additionally what comes to my mind: why can't we catch an invalid WORKER_CLASS even before the job gets posted/cloned? Also introduce some kind of override to catch special edge-cases. I think it could catch 99% of the mistakes before they end up on openQA.

Let's assume a user posts a job for the worker class "vmware". Maybe the one and only worker being able to work on that is down at that time. Should we declare the worker class as "invalid"? Or should we check each scheduled job against our salt-pillars document? I did not follow that route because I would not know how to specifically detect what exactly would be the "configuration error"

#9 Updated by osukup about 2 months ago

define invalid WORKER_CLASS :D

the scheduler uses WORKER_CLASS of free workers to assign jobs to the proper worker. So missing worker with corresponding worker class means in most cases simply all workers valid for jobs are busy now..
usually, jobs with prio about 50 get scheduled +- in 1h, in worst cases under 24H, jobs with low prio like 150 3 - 4 days.

SO safe time limit for cancel job with reason: bad WORKER_CLASS is about 8 days for low prio jobs.

#10 Updated by asmorodskyi about 2 months ago

okurz wrote:

nicksinger wrote:

Additionally what comes to my mind: why can't we catch an invalid WORKER_CLASS even before the job gets posted/cloned? Also introduce some kind of override to catch special edge-cases. I think it could catch 99% of the mistakes before they end up on openQA.

Let's assume a user posts a job for the worker class "vmware". Maybe the one and only worker being able to work on that is down at that time. Should we declare the worker class as "invalid"? Or should we check each scheduled job against our salt-pillars document? I did not follow that route because I would not know how to specifically detect what exactly would be the "configuration error"

I would say yes , if all workers of certain class are down all jobs for this worker class should be treated as invalid . Continue with your example let's imagine that single VMWare server which hosts all instances of "vmware" workers lost it's HDD , new one was obviously ordered and it will come in 2 weeks + several days on set everything up . So question back what you prefer to do with all jobs which will be created during this time ?

#11 Updated by asmorodskyi about 2 months ago

Reading all the comments I must say that you trying to solve some problems which not directly related to one which was the root cause for creating this ticket . Tina contacted me several times because we had jobs with WORKER_CLASS=pc_azure,x86_64 all workers which we have for PubCloud are dedicated to PubCloud only so 0 workers in any state with such a class. And this is something which can be easily detected by querying openQA we just need to use query which returns all workers in any state which have such class combination and if list is empty than no doubts job should be terminated.

#12 Updated by okurz about 2 months ago

  • Copied to action #100973: Cancel any scheduled jobs after a configurable timeout, e.g. days size:M added

#13 Updated by okurz about 2 months ago

  • Target version changed from Ready to future

let's start with the more easy solution #100973 first

#14 Updated by okurz 5 days ago

  • Copied to coordination #102864: [epic] Inform openQA webUI users about potential worker class mismatch or long delays added

Also available in: Atom PDF