Actions
action #125207
closedworker11 host up alert - similar as for worker13
Start date:
2023-03-01
Due date:
% Done:
0%
Estimated time:
Description
Observation¶
Several alert emails about worker11 being (un)available.
Acceptance criteria¶
- AC1: No alerts about worker11
Suggestions¶
- Follow what has been done for #125210
- Confirm what happened to the machine
- Ensure worker11 is stable
- Investigate if there's any recent changes e.g. disk space running out, too many jobs, package installed manually or other things
Files
Updated by livdywan over 1 year ago
- Copied to action #125210: worker13 host up alert - kernel crash size:M added
Updated by okurz over 1 year ago
- Tags set to infra, alert, reactive work, osd, worker11, nbg
Updated by mkittler over 1 year ago
- Related to action #125213: Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de) added
Updated by mkittler over 1 year ago
Not sure whether #125210 is related or whether both workers just crashed coincidentally in a close time-frame (see #125210#note-5 for the initial investigation).
Updated by mkittler over 1 year ago
- File dmesg-worker11.log dmesg-worker11.log added
The vanished crash dump was actually moved to /home/osukup/2023-03-01-01:35
. So everything went fine regarding the crash reporting and our systemd service used to trigger an alert.
We apparently got hardware errors while BTRFS was re-balancing and soon the crash happens. I have attached the dmesg logs for details.
Updated by okurz over 1 year ago
- Subject changed from worker11 host up alert to worker11 host up alert - similar as for worker13
- Description updated (diff)
- Status changed from New to Resolved
Host is stable right now and well-covered in our monitoring and alerting. If any similar case happens again we have the dmesg log output attached here for comparison.
Actions