action #178204: Reduce test start time on openqa.suse.de size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #178204

open

openQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project (public) - coordination #178243: [epic] More efficient handling of big job schedules, not executable jobs, never matching worker classes, etc.

Reduce test start time on openqa.suse.de size:S

Added by gpuliti about 21 hours ago. Updated 8 minutes ago.

Status:

In Progress

Priority:

Normal

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

Due date:

% Done:

Estimated time:

Tags:

openQA, osd, tests, administration, infra

Description

Observation¶

https://monitor.qa.suse.de/d/7W06NBWGk/job-age?orgId=1&from=2025-03-03T02:45:29.209Z&to=2025-03-03T06:58:26.736Z&timezone=UTC

Relevant panel: https://monitor.qa.suse.de/d/7W06NBWGk/job-age?viewPanel=panel-5&orgId=1&from=2025-03-01T19%3A35%3A43.674Z&to=2025-03-04T06%3A19%3A31.656Z&timezone=utc

Based on observations there are recurring alerts indicating long wait times before execution.

gpuliti preferred to not silence the alert since is not that common yet, at least in the last week, but we should try to optimize test scheduling to reduce waiting times.

The main offender seem to be jobs with a worker class config that can never be picked up as there are no workers for "qemu_x86_64,intel,tap", scheduled by "QE Security".

Acceptance Criteria¶

AC1: There is an understanding to remove/change the alert or have another workflow to handle the alert

Suggestions¶

are there any bottlenecks? Answer: No, there aren't. We need to discuss expectations.
Also see similar stories from the past #73174
Report new feature requests to detect jobs that can not be picked up by any current matching worker class and block on that. After that we can cancel such jobs earlier and still keep a sensible alert for jobs that would match current workers but are just delayed for long

Rollback actions¶

Remove silence from https://monitor.qa.suse.de/alerting/silences?alertmanager=grafana alertname=Job age (scheduled) (median) alert

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #178204

Reduce test start time on openqa.suse.de size:S

Observation¶

Acceptance Criteria¶

Suggestions¶

Rollback actions¶

Updated by gpuliti about 21 hours ago

Updated by gpuliti about 20 hours ago

Updated by okurz about 19 hours ago

Updated by mkittler about 6 hours ago

Updated by mkittler about 5 hours ago

Updated by okurz about 4 hours ago

Updated by okurz about 4 hours ago

Updated by okurz about 4 hours ago

Updated by okurz about 4 hours ago

Updated by okurz about 3 hours ago

Updated by tinita 8 minutes ago