action #125210
closedworker13 host up alert - kernel crash size:M
0%
Description
Observation¶
Several alert emails about worker13 being (un)available.
Acceptance criteria¶
- AC1: No alerts about worker13
Suggestions¶
- Confirm what happened to the machine
- Ensure worker13 is stable, e.g. follow https://monitor.qa.suse.de/d/WDworker13/worker-dashboard-worker13?orgId=1&viewPanel=65105&from=1677856992164&to=1678104302256
- Investigate if there's any recent changes e.g. disk space running out, too many jobs, package installed manually or other things
- If the problem does not reproduce then just delete crash dumps and resolve, otherwise ask for hardware replacement
Files
Updated by livdywan over 1 year ago
- Copied from action #125207: worker11 host up alert - similar as for worker13 added
Updated by osukup over 1 year ago
Kernel crash ..
- smartctl reports:
=== START OF INFORMATION SECTION ===
Model Number: SAMSUNG MZVLB1T0HALR-00000
Serial Number: S3W6NX0MA02169
Firmware Version: EXA7301Q
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 1 024 209 543 168 [1,02 TB]
Unallocated NVM Capacity: 0
Controller ID: 4
NVMe Version: 1.2
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1 024 209 543 168 [1,02 TB]
Namespace 1 Utilization: 201 312 768 000 [201 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 8a91b0fb74
Local Time is: Wed Mar 1 11:56:48 2023 CET
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 81 Celsius
Critical Comp. Temp. Threshold: 82 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.02W - - 0 0 0 0 0 0
1 + 6.30W - - 1 1 1 1 0 0
2 + 3.50W - - 2 2 2 2 0 0
3 - 0.0760W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x04
Temperature: 46 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 255%
Data Units Read: 574 863 101 [294 TB]
Data Units Written: 6 334 094 942 [3,24 PB]
Host Read Commands: 2 067 369 623
Host Write Commands: 12 245 676 869
Controller Busy Time: 1 624 178 045
Power Cycles: 10
Power On Hours: 9 368
Unsafe Shutdowns: 8
Media and Data Integrity Errors: 0
Error Information Log Entries: 36
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 46 Celsius
Temperature Sensor 2: 51 Celsius
Error Information (NVMe Log 0x01, 16 of 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 36 0 0x201c 0x4004 - 0 0 -
Updated by mkittler over 1 year ago
- Related to action #125213: Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de) added
Updated by mkittler over 1 year ago
Also see my comments on #125213 about those two workers. Unfortunately the crash dump wasn't there anymore when I had a look. So I'm not only wondering why it crashed but also:
- Why was the crash dump removed?
- Is it related to the crash of #125207 which showed the exact same symptom (except the critical warning from smart)?
Updated by mkittler over 1 year ago
Note that the problematic SSD is nvme0n1
which is not the drive where the root filesystem is installed. So I wouldn't expect a kernel panic just from that (only /var/lib/openqa
is on that SSD).
Updated by osukup over 1 year ago
mkittler wrote:
Also see my comments on #125213 about those two workers. Unfortunately the crash dump wasn't there anymore when I had a look. So I'm not only wondering why it crashed but also:
- Why was the crash dump removed?
- Is it related to the crash of #125207 which showed the exact same symptom (except the critical warning from smart)?
I moved crashdump to my home -> to be albe start service, but still we have it..
Both crashes were in time of btrfs maintenance.. and yes, I don't think nvme problems are related to crash.
Updated by mkittler over 1 year ago
- File dmesg-worker13.log dmesg-worker13.log added
The vanished crash dump was actually moved to /home/osukup/2023-03-01-00:35/dmesg.txt
. So everything went fine regarding the crash reporting and our systemd service used to trigger an alert.
Like on worker11 BTRFS was re-balancing but there were no hardware errors reported. I have attached the dmesg logs for details.
Updated by okurz over 1 year ago
- Subject changed from worker13 host up alert to worker13 host up alert - kernel crash size:M
- Description updated (diff)
- Status changed from New to Resolved
- Assignee set to mkittler