action #125210
closed
worker13 host up alert - kernel crash size:M
Added by livdywan over 1 year ago.
Updated over 1 year ago.
Description
Observation¶
Several alert emails about worker13 being (un)available.
Acceptance criteria¶
- AC1: No alerts about worker13
Suggestions¶
Files
- Copied from action #125207: worker11 host up alert - similar as for worker13 added
Kernel crash ..
=== START OF INFORMATION SECTION ===
Model Number: SAMSUNG MZVLB1T0HALR-00000
Serial Number: S3W6NX0MA02169
Firmware Version: EXA7301Q
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 1 024 209 543 168 [1,02 TB]
Unallocated NVM Capacity: 0
Controller ID: 4
NVMe Version: 1.2
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1 024 209 543 168 [1,02 TB]
Namespace 1 Utilization: 201 312 768 000 [201 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 8a91b0fb74
Local Time is: Wed Mar 1 11:56:48 2023 CET
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 81 Celsius
Critical Comp. Temp. Threshold: 82 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.02W - - 0 0 0 0 0 0
1 + 6.30W - - 1 1 1 1 0 0
2 + 3.50W - - 2 2 2 2 0 0
3 - 0.0760W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x04
Temperature: 46 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 255%
Data Units Read: 574 863 101 [294 TB]
Data Units Written: 6 334 094 942 [3,24 PB]
Host Read Commands: 2 067 369 623
Host Write Commands: 12 245 676 869
Controller Busy Time: 1 624 178 045
Power Cycles: 10
Power On Hours: 9 368
Unsafe Shutdowns: 8
Media and Data Integrity Errors: 0
Error Information Log Entries: 36
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 46 Celsius
Temperature Sensor 2: 51 Celsius
Error Information (NVMe Log 0x01, 16 of 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 36 0 0x201c 0x4004 - 0 0 -
- Related to action #125213: Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de) added
Also see my comments on #125213 about those two workers. Unfortunately the crash dump wasn't there anymore when I had a look. So I'm not only wondering why it crashed but also:
- Why was the crash dump removed?
- Is it related to the crash of #125207 which showed the exact same symptom (except the critical warning from smart)?
Note that the problematic SSD is nvme0n1
which is not the drive where the root filesystem is installed. So I wouldn't expect a kernel panic just from that (only /var/lib/openqa
is on that SSD).
mkittler wrote:
Also see my comments on #125213 about those two workers. Unfortunately the crash dump wasn't there anymore when I had a look. So I'm not only wondering why it crashed but also:
- Why was the crash dump removed?
- Is it related to the crash of #125207 which showed the exact same symptom (except the critical warning from smart)?
I moved crashdump to my home -> to be albe start service, but still we have it..
Both crashes were in time of btrfs maintenance.. and yes, I don't think nvme problems are related to crash.
The vanished crash dump was actually moved to /home/osukup/2023-03-01-00:35/dmesg.txt
. So everything went fine regarding the crash reporting and our systemd service used to trigger an alert.
Like on worker11 BTRFS was re-balancing but there were no hardware errors reported. I have attached the dmesg logs for details.
- Subject changed from worker13 host up alert to worker13 host up alert - kernel crash size:M
- Description updated (diff)
- Status changed from New to Resolved
- Assignee set to mkittler
Also available in: Atom
PDF