action #125210
worker13 host up alert - kernel crash size:M
0%
Description
Observation¶
Several alert emails about worker13 being (un)available.
Acceptance criteria¶
- AC1: No alerts about worker13
Suggestions¶
- Confirm what happened to the machine
- Ensure worker13 is stable, e.g. follow https://monitor.qa.suse.de/d/WDworker13/worker-dashboard-worker13?orgId=1&viewPanel=65105&from=1677856992164&to=1678104302256
- Investigate if there's any recent changes e.g. disk space running out, too many jobs, package installed manually or other things
- If the problem does not reproduce then just delete crash dumps and resolve, otherwise ask for hardware replacement
Related issues
History
#1
Updated by cdywan 3 months ago
- Copied from action #125207: worker11 host up alert - similar as for worker13 added
#2
Updated by osukup 3 months ago
Kernel crash ..
- smartctl reports:
=== START OF INFORMATION SECTION === Model Number: SAMSUNG MZVLB1T0HALR-00000 Serial Number: S3W6NX0MA02169 Firmware Version: EXA7301Q PCI Vendor/Subsystem ID: 0x144d IEEE OUI Identifier: 0x002538 Total NVM Capacity: 1 024 209 543 168 [1,02 TB] Unallocated NVM Capacity: 0 Controller ID: 4 NVMe Version: 1.2 Number of Namespaces: 1 Namespace 1 Size/Capacity: 1 024 209 543 168 [1,02 TB] Namespace 1 Utilization: 201 312 768 000 [201 GB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 002538 8a91b0fb74 Local Time is: Wed Mar 1 11:56:48 2023 CET Firmware Updates (0x16): 3 Slots, no Reset required Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg Maximum Data Transfer Size: 512 Pages Warning Comp. Temp. Threshold: 81 Celsius Critical Comp. Temp. Threshold: 82 Celsius Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 7.02W - - 0 0 0 0 0 0 1 + 6.30W - - 1 1 1 1 0 0 2 + 3.50W - - 2 2 2 2 0 0 3 - 0.0760W - - 3 3 3 3 210 1200 4 - 0.0050W - - 4 4 4 4 2000 8000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 0 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! - NVM subsystem reliability has been degraded SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x04 Temperature: 46 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 255% Data Units Read: 574 863 101 [294 TB] Data Units Written: 6 334 094 942 [3,24 PB] Host Read Commands: 2 067 369 623 Host Write Commands: 12 245 676 869 Controller Busy Time: 1 624 178 045 Power Cycles: 10 Power On Hours: 9 368 Unsafe Shutdowns: 8 Media and Data Integrity Errors: 0 Error Information Log Entries: 36 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 46 Celsius Temperature Sensor 2: 51 Celsius Error Information (NVMe Log 0x01, 16 of 64 entries) Num ErrCount SQId CmdId Status PELoc LBA NSID VS 0 36 0 0x201c 0x4004 - 0 0 -
#4
Updated by mkittler 3 months ago
- Related to action #125213: Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de) added
#5
Updated by mkittler 3 months ago
Also see my comments on #125213 about those two workers. Unfortunately the crash dump wasn't there anymore when I had a look. So I'm not only wondering why it crashed but also:
- Why was the crash dump removed?
- Is it related to the crash of #125207 which showed the exact same symptom (except the critical warning from smart)?
#7
Updated by osukup 3 months ago
mkittler wrote:
Also see my comments on #125213 about those two workers. Unfortunately the crash dump wasn't there anymore when I had a look. So I'm not only wondering why it crashed but also:
- Why was the crash dump removed?
- Is it related to the crash of #125207 which showed the exact same symptom (except the critical warning from smart)?
I moved crashdump to my home -> to be albe start service, but still we have it..
Both crashes were in time of btrfs maintenance.. and yes, I don't think nvme problems are related to crash.
#8
Updated by mkittler 3 months ago
- File dmesg-worker13.log dmesg-worker13.log added
The vanished crash dump was actually moved to /home/osukup/2023-03-01-00:35/dmesg.txt
. So everything went fine regarding the crash reporting and our systemd service used to trigger an alert.
Like on worker11 BTRFS was re-balancing but there were no hardware errors reported. I have attached the dmesg logs for details.