action #135578: Long job age and jobs not executed for long size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #135578

closed

openQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project (public) - coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert

Long job age and jobs not executed for long size:M

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

Urgent

Assignee:

nicksinger

Category:

Target version:

openQA Project (public) - Ready

Start date:

Due date:

% Done:

Estimated time:

Description

Motivation¶

Similar as in #135122 was discovered mostly due to user feedback rather than alert handling that we have long job age and jobs not executed for long. As people are waiting for their jobs to be executed for various products we should ensure short-term mitigations are applied to handle the situation while in the background we fix the underlying problems.

Acceptance criteria¶

AC1: https://monitor.qa.suse.de/d/7W06NBWGk/job-age?orgId=1&tab=alert&from=1694146517954&to=1694507672085&viewPanel=2 is significantly below the alerting threshold

Suggestions¶

Look at the end of of scheduled jobs on https://openqa.suse.de/tests/ and identify why jobs are not picked up in a timely manner

Related issues 5 (1 open — 4 closed)

Related to openQA Infrastructure (public) - action #134927: OSD throws 503, unresponsive for some minutes size:M

Resolved

okurz

2023-08-31

Actions

Related to openQA Infrastructure (public) - action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry

Resolved

nicksinger

2023-08-15

Actions

Related to openQA Infrastructure (public) - action #127523: [qe-core][s390x][kvm] Make use of generic "s390-kvm" class to prevent too long waiting for s390x worker ressources

Resolved

mgrifalconi

Actions

Copied from openQA Infrastructure (public) - action #135380: A significant number of scheduled jobs with one or two running triggers an alert

Resolved

okurz

2023-09-07

Actions

Copied to openQA Project (public) - action #135644: Long job age and jobs not executed for long - malbec not working on jobs since 2023-09-13 - scheduler reserving slots for multi-machine clusters which never come

New

2023-09-13

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #135578

Long job age and jobs not executed for long size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz over 1 year ago

Updated by tinita over 1 year ago

Updated by tinita over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by tinita over 1 year ago

Updated by okurz over 1 year ago

Updated by nicksinger over 1 year ago

Updated by nicksinger over 1 year ago

Updated by nicksinger over 1 year ago

Updated by tinita over 1 year ago

Updated by nicksinger over 1 year ago

Updated by okurz over 1 year ago

Updated by openqa_review over 1 year ago

Updated by livdywan over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by nicksinger over 1 year ago

Updated by tinita over 1 year ago

Updated by okurz over 1 year ago