Project

General

Profile

Actions

coordination #96447

closed

[epic] Failed systemd services and job age alerts

Added by livdywan over 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2021-08-04
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Observation

  • Alerts Disk I/O time for /dev/vdd (/results) handled in #96554
  • Alerts for Job age (scheduled)
  • Alerts for Failed systemd services
    • This is about the alert from 02.08.21 01:17 (the one from 05.08.21 07:26 was caused by a user's misconfiguration).

Suggestion

  • Bump our thresholds
  • Investigate if our average load has increased immensely e.g. new test groups being scheduled
  • Look at systemd journal while the alert is running (short of having #96551)
  • Check if we have data on reduced heat/ power in server room 2
  • Job age (scheduled) (median) is likely due to issues with the WORKER_CLASS of https://openqa.suse.de/tests/6513484

Subtasks 5 (0 open5 closed)

action #96551: Persistent records of systemd journal size:SResolvedokurz2021-10-22

Actions
action #96554: Mitigate on-going disk I/O alerts size:MResolvedmkittler2021-08-04

Actions
action #97043: job queue hitting new record 14k jobsResolvedokurz2021-08-17

Actions
openQA Project - action #105064: Reduce verbosity of openQA logging to improve performance and reduce storage requirements size:MResolvedokurz2021-10-22

Actions
action #105373: Ask to increase OSD /srv so that we can save enough logs+DBResolvedokurz2022-01-24

Actions

Related issues 4 (1 open3 closed)

Related to openQA Infrastructure - action #96710: Error `Can't call method "write" on an undefined value` shows up in worker log leading to incompletesResolvedmkittler2021-08-102021-08-31

Actions
Related to openQA Project - action #96557: jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retryResolvedmkittler2021-08-042021-08-19

Actions
Related to openQA Infrastructure - action #96552: Persistent records of I/O usage by process size:MWorkable2021-08-04

Actions
Related to openQA Project - action #99246: Published QCOW images appear to be uncompressedResolvedokurz2021-09-242021-10-09

Actions
Actions #1

Updated by livdywan over 2 years ago

Disk I/O time for /dev/vde (/space-slow) with vde: read 20068.460 was alterting again for a minute

Actions #2

Updated by livdywan over 2 years ago

  • Description updated (diff)
Actions #3

Updated by livdywan over 2 years ago

Disk I/O time for /dev/vdd (/results) once more alerting with vdd: write 13648.552 for 9 minutes.

Actions #4

Updated by okurz over 2 years ago

  • Priority changed from Normal to Urgent

bumping to "urgent" to counter alarm fatigue.

Actions #5

Updated by livdywan over 2 years ago

  • Subject changed from Disk I/O and job age schedule alerts to [epic] Disk I/O and job age schedule alerts
  • Description updated (diff)
Actions #6

Updated by livdywan over 2 years ago

  • Status changed from New to Workable
Actions #7

Updated by mkittler over 2 years ago

  • Subject changed from [epic] Disk I/O and job age schedule alerts to [epic] Failed systemd services and job age alerts
  • Description updated (diff)
Actions #8

Updated by livdywan over 2 years ago

  • Related to action #96710: Error `Can't call method "write" on an undefined value` shows up in worker log leading to incompletes added
Actions #9

Updated by livdywan over 2 years ago

  • Related to action #96557: jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry added
Actions #11

Updated by livdywan over 2 years ago

  • Related to action #96552: Persistent records of I/O usage by process size:M added
Actions #12

Updated by mkittler over 2 years ago

Not sure how workable this ticket is. There's only one point left of the ones mentioned under observation and retrospectively it is hard to tell what was wrong at the time and we handled future alerts of the same kind. I get that this is an epic but the ticket title and description don't relate with the remaining sub tasks in a meaningful way.

Actions #13

Updated by okurz over 2 years ago

  • Tracker changed from action to coordination
Actions #14

Updated by okurz over 2 years ago

  • Related to action #99246: Published QCOW images appear to be uncompressed added
Actions #15

Updated by okurz over 2 years ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz

Looks like this could also be related to #99246 . I think see have all the mentioned ideas covered in follow up tasks or resolved already. Tracking remaining subtask

Actions #16

Updated by okurz about 2 years ago

  • Status changed from Blocked to Resolved

All subtasks resolved, all tasks and goals covered.

Actions

Also available in: Atom PDF