Project

General

Profile

Actions

action #97043

closed

coordination #96447: [epic] Failed systemd services and job age alerts

job queue hitting new record 14k jobs

Added by okurz over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
2021-08-17
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1628098325601&to=1629026623275 showed a new record high of 14k jobs after a multi-day network outage, likely related to a network loop in one of the QA labs (related to work by EngInfra).

Rollback steps

  • Unpause alert after queue is lower again

Related issues 5 (2 open3 closed)

Related to openQA Infrastructure (public) - action #94237: No alert about too many scheduled tests size:SResolvedokurz2021-06-182021-08-30

Actions
Related to openQA Infrastructure (public) - action #96710: Error `Can't call method "write" on an undefined value` shows up in worker log leading to incompletesResolvedmkittler2021-08-102021-08-31

Actions
Copied to openQA Infrastructure (public) - action #97859: Improve network for OSDNew

Actions
Copied to openQA Infrastructure (public) - action #97862: More openQA worker hardware for OSD size:MResolvedokurz2022-05-23

Actions
Copied to openQA Tests (public) - action #97865: [qe-core][qe-yast][y] Optimize openQA maintenance tests by using less interactive GUI installer as baseNew

Actions
Actions

Also available in: Atom PDF