Project

General

Profile

Actions

action #96554

closed

coordination #96447: [epic] Failed systemd services and job age alerts

Mitigate on-going disk I/O alerts size:M

Added by livdywan almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2021-08-04
Due date:
% Done:

0%

Estimated time:

Description

Observation

  • Alerts Disk I/O time for /dev/vdd (/results)

Suggestion

  • Bump our thresholds
  • Monitor the systemd journal (while the alert is running)
  • Watch htop activity
  • Observe team/squad channels

Related issues 4 (2 open2 closed)

Related to openQA Project - action #96557: jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retryResolvedmkittler2021-08-042021-08-19

Actions
Related to openQA Infrastructure - action #96807: Web UI is slow and Apache Response Time alert got triggeredResolvedokurz2021-08-122021-10-01

Actions
Copied to openQA Infrastructure - action #97409: Re-use existing filesystems on workers after reboot if possible to prevent full worker asset cache re-syncingNew

Actions
Copied to openQA Infrastructure - action #97412: Reduce I/O load on OSD by using more cache size on workers with using free disk space when available instead of hardcoded spaceNew

Actions
Actions

Also available in: Atom PDF