Project

General

Profile

Actions

coordination #96447

closed

[epic] Failed systemd services and job age alerts

Added by livdywan almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2021-08-04
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Observation

  • Alerts Disk I/O time for /dev/vdd (/results) handled in #96554
  • Alerts for Job age (scheduled)
  • Alerts for Failed systemd services
    • This is about the alert from 02.08.21 01:17 (the one from 05.08.21 07:26 was caused by a user's misconfiguration).

Suggestion

  • Bump our thresholds
  • Investigate if our average load has increased immensely e.g. new test groups being scheduled
  • Look at systemd journal while the alert is running (short of having #96551)
  • Check if we have data on reduced heat/ power in server room 2
  • Job age (scheduled) (median) is likely due to issues with the WORKER_CLASS of https://openqa.suse.de/tests/6513484

Subtasks 5 (0 open5 closed)

action #96551: Persistent records of systemd journal size:SResolvedokurz2021-10-22

Actions
action #96554: Mitigate on-going disk I/O alerts size:MResolvedmkittler2021-08-04

Actions
action #97043: job queue hitting new record 14k jobsResolvedokurz2021-08-17

Actions
openQA Project - action #105064: Reduce verbosity of openQA logging to improve performance and reduce storage requirements size:MResolvedokurz2021-10-22

Actions
action #105373: Ask to increase OSD /srv so that we can save enough logs+DBResolvedokurz2022-01-24

Actions

Related issues 4 (1 open3 closed)

Related to openQA Infrastructure - action #96710: Error `Can't call method "write" on an undefined value` shows up in worker log leading to incompletesResolvedmkittler2021-08-102021-08-31

Actions
Related to openQA Project - action #96557: jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retryResolvedmkittler2021-08-042021-08-19

Actions
Related to openQA Infrastructure - action #96552: Persistent records of I/O usage by process size:MWorkable2021-08-04

Actions
Related to openQA Project - action #99246: Published QCOW images appear to be uncompressedResolvedokurz2021-09-242021-10-09

Actions
Actions

Also available in: Atom PDF