action #98562
openCancel jobs with invalid WORKER_CLASS after a timeout
0%
Description
Motivation¶
@tinita was investigating job age alerts and found a job with a WORKER_CLASS that doesn't match any workers. This was traced by to @asmorodskyi who then identified the change which was incorrect use of +WORKER_CLASS
(+WORKER_CLASS is combined rather than overridden).
Regardless of what caused this, instead of a developer monitoring jobs and figuring out what happened, we should have openQA cancel unmatch.
Acceptance criteria¶
- AC1: Cancel unmatched jobs after a timeout
- AC2: File a ticket
Workaround¶
Have a person monitor alerts and investigate jobs that never run, cancel the job and file a new ticket.
Updated by tinita about 3 years ago
I would like to note several things.
tinita was investigating job age alerts
No, I wasn't, we didn't have an alert today. We had a Job Age alert several weeks/months ago, and it was caused by a job with a WORKER_CLASS which didn't have a matching worker.
To avoid alerts, I started looking into the Grafana board regularly to cancel such jobs before they grow too old and maybe hide real problems in Grafana.
and found a job with a WORKER_CLASS that doesn't match any workers.
The mentioned WORKER_CLASS qemu_x86_64,pc_azure
was used in several jobs since I started to monitor Grafana, and I cancelled them all and mentioned asmorodskyi in our testing channel.
This was traced by to asmorodskyi who then identified the change which was incorrect use of +WORKER_CLASS (+WORKER_CLASS is combined rather than overridden).
Today I saw such a job again, and asmorodskyi sent us a link to a Gitlab commit. Both the job and the Gitlab MR are 4 days old. The time of the job is not clear yet because of unknown timezone.
So, through the last weeks I saw several jobs with WORKER_CLASS=qemu_x86_64,pc_azure
. How can a commit from 4 days ago be responsible for that?
Updated by okurz about 3 years ago
- Priority changed from Normal to Low
- Target version set to Ready
I think the feature request is a great idea. As an exception I am adding it immediately to the backlog (though with low prio) because I think it should help us with endlessly stuck jobs preventing further alerts with some good hint to the user instead
Updated by osukup about 3 years ago
how differentiate between non-existent WORKER_CLASS and jobs with low prio when the queue is full for a long time?
Updated by livdywan about 3 years ago
osukup wrote:
how differentiate between non-existent WORKER_CLASS and jobs with low prio when the queue is full for a long time?
I think technically you don't need to check it. And the AC just says Cancel unmatched jobs after a timeout
. If the job is not getting picked up after n seconds, cancel it.
Updated by osukup about 3 years ago
I think technically you don't need to check it. And the AC just says
Cancel unmatched jobs after a timeout
. If the job is not getting picked up after n seconds, cancel it.
sounds right,
but for something with prior 30 should be timeout different than for job with prio 160
something like formula (prio/100) * BASE_TIMEOUT
where BASE_TIMEOUT is about 6 days ?
Updated by okurz about 3 years ago
For a start I would make it really simple and define a simple hard criterion: The time, nothing else. Based on how that suits us we can eventually think about incorporating priority or other factors.
Updated by nicksinger about 3 years ago
I faced a similar issue in my alert duty today again. I really think this feature would be useful.
Additionally what comes to my mind: why can't we catch an invalid WORKER_CLASS even before the job gets posted/cloned? Also introduce some kind of override to catch special edge-cases. I think it could catch 99% of the mistakes before they end up on openQA.
Updated by okurz about 3 years ago
nicksinger wrote:
Additionally what comes to my mind: why can't we catch an invalid WORKER_CLASS even before the job gets posted/cloned? Also introduce some kind of override to catch special edge-cases. I think it could catch 99% of the mistakes before they end up on openQA.
Let's assume a user posts a job for the worker class "vmware". Maybe the one and only worker being able to work on that is down at that time. Should we declare the worker class as "invalid"? Or should we check each scheduled job against our salt-pillars document? I did not follow that route because I would not know how to specifically detect what exactly would be the "configuration error"
Updated by osukup about 3 years ago
define invalid WORKER_CLASS :D
the scheduler uses WORKER_CLASS of free workers to assign jobs to the proper worker. So missing worker with corresponding worker class means in most cases simply all workers valid for jobs are busy now..
usually, jobs with prio about 50 get scheduled +- in 1h, in worst cases under 24H, jobs with low prio like 150 3 - 4 days.
SO safe time limit for cancel job with reason: bad WORKER_CLASS is about 8 days for low prio jobs.
Updated by asmorodskyi about 3 years ago
okurz wrote:
nicksinger wrote:
Additionally what comes to my mind: why can't we catch an invalid WORKER_CLASS even before the job gets posted/cloned? Also introduce some kind of override to catch special edge-cases. I think it could catch 99% of the mistakes before they end up on openQA.
Let's assume a user posts a job for the worker class "vmware". Maybe the one and only worker being able to work on that is down at that time. Should we declare the worker class as "invalid"? Or should we check each scheduled job against our salt-pillars document? I did not follow that route because I would not know how to specifically detect what exactly would be the "configuration error"
I would say yes , if all workers of certain class are down all jobs for this worker class should be treated as invalid . Continue with your example let's imagine that single VMWare server which hosts all instances of "vmware" workers lost it's HDD , new one was obviously ordered and it will come in 2 weeks + several days on set everything up . So question back what you prefer to do with all jobs which will be created during this time ?
Updated by asmorodskyi about 3 years ago
Reading all the comments I must say that you trying to solve some problems which not directly related to one which was the root cause for creating this ticket . Tina contacted me several times because we had jobs with WORKER_CLASS=pc_azure,x86_64 all workers which we have for PubCloud are dedicated to PubCloud only so 0 workers in any state with such a class. And this is something which can be easily detected by querying openQA we just need to use query which returns all workers in any state which have such class combination and if list is empty than no doubts job should be terminated.
Updated by okurz about 3 years ago
- Copied to action #100973: Cancel any scheduled jobs after a configurable timeout, e.g. days size:M added
Updated by okurz about 3 years ago
- Target version changed from Ready to future
let's start with the more easy solution #100973 first
Updated by okurz almost 3 years ago
- Copied to coordination #102864: [epic] Inform openQA webUI users about potential worker class mismatch or long delays added