QA &raquo; openQA Project &raquo; openQA Infrastructure

Category:

Target version:

openQA Project - Ready

Start date:

2021-08-04

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Description

Observation¶

~~Alerts Disk I/O time for /dev/vdd (/results)~~ handled in #96554
~~Alerts for Job age (scheduled)~~
Alerts for Failed systemd services
- This is about the alert from 02.08.21 01:17 (the one from 05.08.21 07:26 was caused by a user's misconfiguration).

Suggestion¶

Bump our thresholds
Investigate if our average load has increased immensely e.g. new test groups being scheduled
Look at systemd journal while the alert is running (short of having #96551)
Check if we have data on reduced heat/ power in server room 2
~~Job age (scheduled) (median) is likely due to issues with the WORKER_CLASS of https://openqa.suse.de/tests/6513484~~

Subtasks 5 (0 open — 5 closed)

action #96551: Persistent records of systemd journal size:S

Resolved

2021-10-22

action #96554: Mitigate on-going disk I/O alerts size:M

Resolved

mkittler

2021-08-04

action #97043: job queue hitting new record 14k jobs

Resolved

2021-08-17

openQA Project - action #105064: Reduce verbosity of openQA logging to improve performance and reduce storage requirements size:M

Resolved

2021-10-22

action #105373: Ask to increase OSD /srv so that we can save enough logs+DB

Resolved

2022-01-24

Related issues 4 (1 open — 3 closed)

Related to openQA Infrastructure - action #96710: Error `Can't call method "write" on an undefined value` shows up in worker log leading to incompletes

Resolved

mkittler

2021-08-10

2021-08-31

Related to openQA Project - action #96557: jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry

Resolved

mkittler

2021-08-04

2021-08-19

Related to openQA Infrastructure - action #96552: Persistent records of I/O usage by process size:M

Workable

2021-08-04

Related to openQA Project - action #99246: Published QCOW images appear to be uncompressed

Resolved

2021-09-24

2021-10-09

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Updated by livdywan almost 3 years ago

Disk I/O time for /dev/vde (/space-slow) with vde: read 20068.460 was alterting again for a minute

Actions

Updated by livdywan almost 3 years ago

Description updated (diff)

Actions

Updated by livdywan almost 3 years ago

Disk I/O time for /dev/vdd (/results) once more alerting with vdd: write 13648.552 for 9 minutes.

Actions

Updated by okurz almost 3 years ago

Priority changed from Normal to Urgent

bumping to "urgent" to counter alarm fatigue.

Actions

Updated by livdywan almost 3 years ago

Subject changed from Disk I/O and job age schedule alerts to [epic] Disk I/O and job age schedule alerts
Description updated (diff)

Actions

Updated by livdywan almost 3 years ago

Status changed from New to Workable

Actions

Updated by mkittler almost 3 years ago

Subject changed from [epic] Disk I/O and job age schedule alerts to [epic] Failed systemd services and job age alerts
Description updated (diff)

Actions

Updated by livdywan almost 3 years ago

Related to action #96710: Error `Can't call method "write" on an undefined value` shows up in worker log leading to incompletes added

Actions

Updated by livdywan almost 3 years ago

Related to action #96557: jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry added

Actions

#11

Updated by livdywan almost 3 years ago

Related to action #96552: Persistent records of I/O usage by process size:M added

Actions

#12

Updated by mkittler almost 3 years ago

Not sure how workable this ticket is. There's only one point left of the ones mentioned under observation and retrospectively it is hard to tell what was wrong at the time and we handled future alerts of the same kind. I get that this is an epic but the ticket title and description don't relate with the remaining sub tasks in a meaningful way.

Actions

#13

Updated by okurz almost 3 years ago

Tracker changed from action to coordination

Actions

#14

Updated by okurz almost 3 years ago

Related to action #99246: Published QCOW images appear to be uncompressed added

Actions

#15

Updated by okurz almost 3 years ago

Status changed from Workable to Blocked
Assignee set to okurz

Looks like this could also be related to #99246 . I think see have all the mentioned ideas covered in follow up tasks or resolved already. Tracking remaining subtask

Actions