action #98562: Cancel jobs with invalid WORKER_CLASS after a timeout - openQA Project - openSUSE Project Management Tool

Custom queries

All 'new' issues w/o assignee, sorted by version/priority
All auto_review tickets
All auto_review+force_result tickets
openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE Tools team - due soon
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #98562

open

Cancel jobs with invalid WORKER_CLASS after a timeout

Added by livdywan about 3 years ago. Updated about 3 years ago.

Status:

New

Priority:

Low

Assignee:

Category:

Feature requests

Target version:

QA - future

Start date:

2021-09-13

Due date:

% Done:

Estimated time:

Description

Motivation¶

@tinita was investigating job age alerts and found a job with a WORKER_CLASS that doesn't match any workers. This was traced by to @asmorodskyi who then identified the change which was incorrect use of +WORKER_CLASS (+WORKER_CLASS is combined rather than overridden).

Regardless of what caused this, instead of a developer monitoring jobs and figuring out what happened, we should have openQA cancel unmatch.

Acceptance criteria¶

AC1: Cancel unmatched jobs after a timeout
AC2: File a ticket

Workaround¶

Have a person monitor alerts and investigate jobs that never run, cancel the job and file a new ticket.

Related issues 2 (1 open — 1 closed)

Copied to openQA Project - action #100973: Cancel any scheduled jobs after a configurable timeout, e.g. days size:M

Resolved

osukup

2021-09-13

Actions

Copied to openQA Project - coordination #102864: [epic] Inform openQA webUI users about potential worker class mismatch or long delays

New

2021-09-13

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by tinita about 3 years ago

I would like to note several things.

tinita was investigating job age alerts

No, I wasn't, we didn't have an alert today. We had a Job Age alert several weeks/months ago, and it was caused by a job with a WORKER_CLASS which didn't have a matching worker.

To avoid alerts, I started looking into the Grafana board regularly to cancel such jobs before they grow too old and maybe hide real problems in Grafana.

and found a job with a WORKER_CLASS that doesn't match any workers.

The mentioned WORKER_CLASS qemu_x86_64,pc_azure was used in several jobs since I started to monitor Grafana, and I cancelled them all and mentioned asmorodskyi in our testing channel.

This was traced by to asmorodskyi who then identified the change which was incorrect use of +WORKER_CLASS (+WORKER_CLASS is combined rather than overridden).

Today I saw such a job again, and asmorodskyi sent us a link to a Gitlab commit. Both the job and the Gitlab MR are 4 days old. The time of the job is not clear yet because of unknown timezone.

So, through the last weeks I saw several jobs with WORKER_CLASS=qemu_x86_64,pc_azure. How can a commit from 4 days ago be responsible for that?

Actions

Copy link

Updated by okurz about 3 years ago

Priority changed from Normal to Low
Target version set to Ready

I think the feature request is a great idea. As an exception I am adding it immediately to the backlog (though with low prio) because I think it should help us with endlessly stuck jobs preventing further alerts with some good hint to the user instead

Actions

Copy link

Updated by osukup about 3 years ago

how differentiate between non-existent WORKER_CLASS and jobs with low prio when the queue is full for a long time?

Actions

Copy link

Updated by livdywan about 3 years ago

osukup wrote:

how differentiate between non-existent WORKER_CLASS and jobs with low prio when the queue is full for a long time?

I think technically you don't need to check it. And the AC just says Cancel unmatched jobs after a timeout. If the job is not getting picked up after n seconds, cancel it.

Actions

Copy link

Updated by osukup about 3 years ago

I think technically you don't need to check it. And the AC just says Cancel unmatched jobs after a timeout. If the job is not getting picked up after n seconds, cancel it.

sounds right,

but for something with prior 30 should be timeout different than for job with prio 160

something like formula (prio/100) * BASE_TIMEOUT where BASE_TIMEOUT is about 6 days ?

Actions

Copy link

Updated by okurz about 3 years ago

For a start I would make it really simple and define a simple hard criterion: The time, nothing else. Based on how that suits us we can eventually think about incorporating priority or other factors.

Actions

Copy link

Updated by nicksinger about 3 years ago

I faced a similar issue in my alert duty today again. I really think this feature would be useful.

Additionally what comes to my mind: why can't we catch an invalid WORKER_CLASS even before the job gets posted/cloned? Also introduce some kind of override to catch special edge-cases. I think it could catch 99% of the mistakes before they end up on openQA.

Actions

Copy link

Updated by okurz about 3 years ago

nicksinger wrote:

Additionally what comes to my mind: why can't we catch an invalid WORKER_CLASS even before the job gets posted/cloned? Also introduce some kind of override to catch special edge-cases. I think it could catch 99% of the mistakes before they end up on openQA.

Let's assume a user posts a job for the worker class "vmware". Maybe the one and only worker being able to work on that is down at that time. Should we declare the worker class as "invalid"? Or should we check each scheduled job against our salt-pillars document? I did not follow that route because I would not know how to specifically detect what exactly would be the "configuration error"

Actions

Copy link

Updated by osukup about 3 years ago

define invalid WORKER_CLASS :D

the scheduler uses WORKER_CLASS of free workers to assign jobs to the proper worker. So missing worker with corresponding worker class means in most cases simply all workers valid for jobs are busy now..
usually, jobs with prio about 50 get scheduled +- in 1h, in worst cases under 24H, jobs with low prio like 150 3 - 4 days.

SO safe time limit for cancel job with reason: bad WORKER_CLASS is about 8 days for low prio jobs.

Actions

Copy link

#10

Updated by asmorodskyi about 3 years ago

okurz wrote:

nicksinger wrote:

Additionally what comes to my mind: why can't we catch an invalid WORKER_CLASS even before the job gets posted/cloned? Also introduce some kind of override to catch special edge-cases. I think it could catch 99% of the mistakes before they end up on openQA.

Let's assume a user posts a job for the worker class "vmware". Maybe the one and only worker being able to work on that is down at that time. Should we declare the worker class as "invalid"? Or should we check each scheduled job against our salt-pillars document? I did not follow that route because I would not know how to specifically detect what exactly would be the "configuration error"

I would say yes , if all workers of certain class are down all jobs for this worker class should be treated as invalid . Continue with your example let's imagine that single VMWare server which hosts all instances of "vmware" workers lost it's HDD , new one was obviously ordered and it will come in 2 weeks + several days on set everything up . So question back what you prefer to do with all jobs which will be created during this time ?

Actions

Copy link

#11

Updated by asmorodskyi about 3 years ago

Reading all the comments I must say that you trying to solve some problems which not directly related to one which was the root cause for creating this ticket . Tina contacted me several times because we had jobs with WORKER_CLASS=pc_azure,x86_64 all workers which we have for PubCloud are dedicated to PubCloud only so 0 workers in any state with such a class. And this is something which can be easily detected by querying openQA we just need to use query which returns all workers in any state which have such class combination and if list is empty than no doubts job should be terminated.

Actions

Copy link

#12

Updated by okurz about 3 years ago

Copied to action #100973: Cancel any scheduled jobs after a configurable timeout, e.g. days size:M added

Actions

Copy link

#13

Updated by okurz about 3 years ago

Target version changed from Ready to future

let's start with the more easy solution #100973 first

Actions

Copy link

#14

Updated by okurz almost 3 years ago

Copied to coordination #102864: [epic] Inform openQA webUI users about potential worker class mismatch or long delays added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project

Tags

Custom queries

action #98562

Cancel jobs with invalid WORKER_CLASS after a timeout

Motivation¶

Acceptance criteria¶

Workaround¶

Updated by tinita about 3 years ago

Updated by okurz about 3 years ago

Updated by osukup about 3 years ago

Updated by livdywan about 3 years ago

Updated by osukup about 3 years ago

Updated by okurz about 3 years ago

Updated by nicksinger about 3 years ago

Updated by okurz about 3 years ago

Updated by osukup about 3 years ago

Updated by asmorodskyi about 3 years ago

Updated by asmorodskyi about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz almost 3 years ago