Project

General

Profile

Actions

action #178204

open

openQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project (public) - coordination #178243: [epic] More efficient handling of big job schedules, not executable jobs, never matching worker classes, etc.

Reduce test start time on openqa.suse.de size:S

Added by gpuliti about 21 hours ago. Updated 8 minutes ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://monitor.qa.suse.de/d/7W06NBWGk/job-age?orgId=1&from=2025-03-03T02:45:29.209Z&to=2025-03-03T06:58:26.736Z&timezone=UTC

Relevant panel: https://monitor.qa.suse.de/d/7W06NBWGk/job-age?viewPanel=panel-5&orgId=1&from=2025-03-01T19%3A35%3A43.674Z&to=2025-03-04T06%3A19%3A31.656Z&timezone=utc

Based on observations there are recurring alerts indicating long wait times before execution.

gpuliti preferred to not silence the alert since is not that common yet, at least in the last week, but we should try to optimize test scheduling to reduce waiting times.

The main offender seem to be jobs with a worker class config that can never be picked up as there are no workers for "qemu_x86_64,intel,tap", scheduled by "QE Security".

Acceptance Criteria

  • AC1: There is an understanding to remove/change the alert or have another workflow to handle the alert

Suggestions

  • are there any bottlenecks? Answer: No, there aren't. We need to discuss expectations.
  • Also see similar stories from the past #73174
  • Report new feature requests to detect jobs that can not be picked up by any current matching worker class and block on that. After that we can cancel such jobs earlier and still keep a sensible alert for jobs that would match current workers but are just delayed for long

Rollback actions


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #73174: [osd][alert] Job age (scheduled) (median) alertResolvedokurz2020-10-09

Actions
Actions

Also available in: Atom PDF