worker11 host up alert - similar as for worker13
Several alert emails about worker11 being (un)available.
- AC1: No alerts about worker11
- Follow what has been done for #125210
- Confirm what happened to the machine
- Ensure worker11 is stable
- Investigate if there's any recent changes e.g. disk space running out, too many jobs, package installed manually or other things
#1 Updated by cdywan 3 months ago
- Copied to action #125210: worker13 host up alert - kernel crash size:M added
#4 Updated by mkittler 3 months ago
- Related to action #125213: Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de) added
#5 Updated by mkittler 3 months ago
Not sure whether #125210 is related or whether both workers just crashed coincidentally in a close time-frame (see #125210#note-5 for the initial investigation).
#7 Updated by mkittler 3 months ago
- File dmesg-worker11.log dmesg-worker11.log added
The vanished crash dump was actually moved to
/home/osukup/2023-03-01-01:35. So everything went fine regarding the crash reporting and our systemd service used to trigger an alert.
We apparently got hardware errors while BTRFS was re-balancing and soon the crash happens. I have attached the dmesg logs for details.
#8 Updated by okurz 3 months ago
- Subject changed from worker11 host up alert to worker11 host up alert - similar as for worker13
- Description updated (diff)
- Status changed from New to Resolved
Host is stable right now and well-covered in our monitoring and alerting. If any similar case happens again we have the dmesg log output attached here for comparison.