action #125207
closed
worker11 host up alert - similar as for worker13
Added by livdywan almost 2 years ago.
Updated almost 2 years ago.
Description
Observation¶
Several alert emails about worker11 being (un)available.
Acceptance criteria¶
- AC1: No alerts about worker11
Suggestions¶
- Follow what has been done for #125210
- Confirm what happened to the machine
- Ensure worker11 is stable
- Investigate if there's any recent changes e.g. disk space running out, too many jobs, package installed manually or other things
Files
- Copied to action #125210: worker13 host up alert - kernel crash size:M added
- Tags set to infra, alert, reactive work, osd, worker11, nbg
- Related to action #125213: Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de) added
Not sure whether #125210 is related or whether both workers just crashed coincidentally in a close time-frame (see #125210#note-5 for the initial investigation).
The vanished crash dump was actually moved to /home/osukup/2023-03-01-01:35
. So everything went fine regarding the crash reporting and our systemd service used to trigger an alert.
We apparently got hardware errors while BTRFS was re-balancing and soon the crash happens. I have attached the dmesg logs for details.
- Subject changed from worker11 host up alert to worker11 host up alert - similar as for worker13
- Description updated (diff)
- Status changed from New to Resolved
Host is stable right now and well-covered in our monitoring and alerting. If any similar case happens again we have the dmesg log output attached here for comparison.
Also available in: Atom
PDF