action #178204: Reduce test start time on openqa.suse.de - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #178204

open

openQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project (public) - coordination #178243: [epic] More efficient handling of big job schedules, not executable jobs, never matching worker classes, etc.

Reduce test start time on openqa.suse.de

Added by gpuliti about 20 hours ago. Updated about 2 hours ago.

Status:

In Progress

Priority:

Normal

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

Due date:

% Done:

Estimated time:

Tags:

openQA, osd, tests, administration, infra

Description

Observation¶

https://monitor.qa.suse.de/d/7W06NBWGk/job-age?orgId=1&from=2025-03-03T02:45:29.209Z&to=2025-03-03T06:58:26.736Z&timezone=UTC

Relevant panel: https://monitor.qa.suse.de/d/7W06NBWGk/job-age?viewPanel=panel-5&orgId=1&from=2025-03-01T19%3A35%3A43.674Z&to=2025-03-04T06%3A19%3A31.656Z&timezone=utc

Based on observations there are recurring alerts indicating long wait times before execution.

gpuliti preferred to not silence the alert since is not that common yet, at least in the last week, but we should try to optimize test scheduling to reduce waiting times.

The main offender seem to be jobs with a worker class config that can never be picked up as there are no workers for "qemu_x86_64,intel,tap", scheduled by "QE Security"

Suggestions¶

are there any bottlenecks? Answer: No, there aren't. The main problem is
Also see similar stories from the past #73174
Report new feature requests to detect jobs that can not be picked up by any current matching worker class and block on that. After that we can cancel such jobs earlier and still keep a sensible alert for jobs that would match current workers but are just delayed for long

Rollback actions¶

Remove silence from https://monitor.qa.suse.de/alerting/silences?alertmanager=grafana alertname=Job age (scheduled) (median) alert

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by gpuliti about 20 hours ago

Copied from action #174235: Cover code of os-autoinst path script/os-autoinst-openvswitch fully (statement coverage) size:S added

Actions

Copy link

Updated by gpuliti about 20 hours ago

Copied from deleted (action #174235: Cover code of os-autoinst path script/os-autoinst-openvswitch fully (statement coverage) size:S)

Actions

Copy link

Updated by okurz about 18 hours ago

Tags set to osd, infra, administration, openqa, tests
Project changed from openQA Project (public) to openQA Infrastructure (public)
Category changed from Regressions/Crashes to Regressions/Crashes
Priority changed from Normal to Urgent

Made urgent as is this is related to a recent alert and not silenced and no mitigation applied yet

Actions

Copy link

Updated by mkittler about 5 hours ago

I mentioned the problematic old jobs on #eng-testing:

There are jobs scheduled on OSD with the worker class qemu_x86_64,intel,tap. Those cannot be scheduled because the combination intel,tap doesn't exist at the moment. I suppose qesapworker-prgX workers would in theory provide that but the tap worker class is disabled there as tap_secondary. Not sure what the best solution is.

There is also a s390-kvm,tap job which is also a combination that doesn't exist.

Actions

Copy link

Updated by mkittler about 5 hours ago

Description updated (diff)

Actions

Copy link

Updated by okurz about 4 hours ago

Status changed from New to In Progress
Assignee set to okurz

Actions

Copy link

Updated by okurz about 4 hours ago

Description updated (diff)
Priority changed from Urgent to High

Actions

Copy link

Updated by okurz about 3 hours ago

Related to action #73174: [osd][alert] Job age (scheduled) (median) alert added

Actions

Copy link

Updated by okurz about 3 hours ago

Parent task set to #178243

Actions

Copy link

#10

Updated by okurz about 2 hours ago

Description updated (diff)
Priority changed from High to Normal

Asked in Slack #discuss-qe-security https://suse.slack.com/archives/C044KDGKW58/p1741003651502569

(Oliver Kurz) Hi, I found https://openqa.suse.de/tests/16908416# which is scheduled since 4 days and not picked up. Jobs with this worker class config can never be picked up as there are no workers for "qemu_x86_64,intel,tap". Unless somebody just added those tests recently I wonder why that did not impact you more as with such jobs never finishing also build validation would never finish. Can you elaborate?
(Timo Jyrinki) They worked in October when they were added https://openqa.suse.de/tests/15723250, so it's likely a bit of wishful thinking that it went away due to infrastructure arrangements and would come back some day, and that it'd be an useful indicator that "this type of worker is still missing". The same happened eg with coppi (x86 IPMI), it eventually came back after we had a couple of months of similar never-executed jobs. I'm not sure why next&previous tab doesn't correctly show the months of runs, it only shows whatever is being looked at right now + the latest (scheduled) one. We should still have also our own ticket to track that - maybe it has not been created because scheduled / cancelled jobs do not prevent "black badge" from appearing, only failing ones, and the black badging has been our main goal of review.
(Oliver Kurz) That's a very good explanation. The point "it has not been created because scheduled / cancelled jobs do not prevent "black badge" from appearing" is crucial for us to consider. […] from the reviewer's perspective it makes sense to look at the black certificate which means: There are no failures left for review. But from a product quality perspective that is for sure not enough because those could be reviewed but critical failures or critical loss of test coverage. So am I right to assume that the main problem is then the mediocre process about build validation as in "just tell me if you found critical bugs", not caring if tests never run?
(Timo Jyrinki) That is one way of putting it, yes. The focus is on "did you find critical bugs" combined with "report tomorrow afternoon now that we finally managed to do a build", not "were you able to fix all issues you may have in test infra or tests, to have full coverage, and if not how much more time you need?". Of course, generally when doing a report I consider if the coverage was "ok", but just on a high level compared to the previous builds on the summary page (taking into account possible refactorings affecting the numbers), and if we have considerably lower coverage I'll report that too.
(Oliver Kurz) ok. So I am tending towards not alerting the tools team on too high job age as we can't fix such cases of not picked up tests without you as test owners and I understand why there is little motivation on your side to do it so we also shouldn't care about it as such jobs will be cancelled anyway after 7 days when not picked up

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #178204

Reduce test start time on openqa.suse.de

Observation¶

Suggestions¶

Rollback actions¶

Updated by gpuliti about 20 hours ago

Updated by gpuliti about 20 hours ago

Updated by okurz about 18 hours ago

Updated by mkittler about 5 hours ago

Updated by mkittler about 5 hours ago

Updated by okurz about 4 hours ago

Updated by okurz about 4 hours ago

Updated by okurz about 3 hours ago

Updated by okurz about 3 hours ago

Updated by okurz about 2 hours ago