coordination #96447
closed
[epic] Failed systemd services and job age alerts
Added by livdywan almost 3 years ago.
Updated over 2 years ago.
Estimated time:
(Total: 0.00 h)
Description
Observation¶
Alerts Disk I/O time for /dev/vdd (/results)
handled in #96554
Alerts for Job age (scheduled)
- Alerts for
Failed systemd services
- This is about the alert from 02.08.21 01:17 (the one from 05.08.21 07:26 was caused by a user's misconfiguration).
Suggestion¶
- Bump our thresholds
- Investigate if our average load has increased immensely e.g. new test groups being scheduled
- Look at systemd journal while the alert is running (short of having #96551)
- Check if we have data on reduced heat/ power in server room 2
Job age (scheduled) (median)
is likely due to issues with the WORKER_CLASS
of https://openqa.suse.de/tests/6513484
Disk I/O time for /dev/vde (/space-slow)
with vde: read 20068.460
was alterting again for a minute
- Description updated (diff)
Disk I/O time for /dev/vdd (/results)
once more alerting with vdd: write 13648.552
for 9 minutes.
- Priority changed from Normal to Urgent
bumping to "urgent" to counter alarm fatigue.
- Subject changed from Disk I/O and job age schedule alerts to [epic] Disk I/O and job age schedule alerts
- Description updated (diff)
- Status changed from New to Workable
- Subject changed from [epic] Disk I/O and job age schedule alerts to [epic] Failed systemd services and job age alerts
- Description updated (diff)
- Related to action #96710: Error `Can't call method "write" on an undefined value` shows up in worker log leading to incompletes added
- Related to action #96557: jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry added
- Related to action #96552: Persistent records of I/O usage by process size:M added
Not sure how workable this ticket is. There's only one point left of the ones mentioned under observation and retrospectively it is hard to tell what was wrong at the time and we handled future alerts of the same kind. I get that this is an epic but the ticket title and description don't relate with the remaining sub tasks in a meaningful way.
- Tracker changed from action to coordination
- Related to action #99246: Published QCOW images appear to be uncompressed added
- Status changed from Workable to Blocked
- Assignee set to okurz
Looks like this could also be related to #99246 . I think see have all the mentioned ideas covered in follow up tasks or resolved already. Tracking remaining subtask
- Status changed from Blocked to Resolved
All subtasks resolved, all tasks and goals covered.
Also available in: Atom
PDF