Project

General

Profile

Actions

action #135578

closed

openQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project - coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert

Long job age and jobs not executed for long size:M

Added by okurz 8 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Similar as in #135122 was discovered mostly due to user feedback rather than alert handling that we have long job age and jobs not executed for long. As people are waiting for their jobs to be executed for various products we should ensure short-term mitigations are applied to handle the situation while in the background we fix the underlying problems.

Acceptance criteria

Suggestions


Related issues 5 (1 open4 closed)

Related to openQA Infrastructure - action #134927: OSD throws 503, unresponsive for some minutes size:MResolvedokurz2023-08-31

Actions
Related to openQA Infrastructure - action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retryResolvednicksinger2023-08-15

Actions
Related to openQA Infrastructure - action #127523: [qe-core][s390x][kvm] Make use of generic "s390-kvm" class to prevent too long waiting for s390x worker ressourcesResolvedmgrifalconi

Actions
Copied from openQA Infrastructure - action #135380: A significant number of scheduled jobs with one or two running triggers an alertResolvedokurz2023-09-07

Actions
Copied to openQA Project - action #135644: Long job age and jobs not executed for long - malbec not working on jobs since 2023-09-13 - scheduler reserving slots for multi-machine clusters which never comeNew2023-09-13

Actions
Actions

Also available in: Atom PDF