Project

General

Profile

Actions

action #125210

closed

worker13 host up alert - kernel crash size:M

Added by livdywan almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2023-03-01
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

Several alert emails about worker13 being (un)available.

Acceptance criteria

  • AC1: No alerts about worker13

Suggestions


Files

dmesg-worker13.log (106 KB) dmesg-worker13.log mkittler, 2023-03-06 09:59

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure (public) - action #125213: Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de)Closedmkittler2023-03-012023-03-16

Actions
Copied from openQA Infrastructure (public) - action #125207: worker11 host up alert - similar as for worker13Resolvedmkittler2023-03-01

Actions
Actions #1

Updated by livdywan almost 2 years ago

  • Copied from action #125207: worker11 host up alert - similar as for worker13 added
Actions #2

Updated by osukup almost 2 years ago

Kernel crash ..

  • smartctl reports:
=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZVLB1T0HALR-00000
Serial Number:                      S3W6NX0MA02169
Firmware Version:                   EXA7301Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1 024 209 543 168 [1,02 TB]
Unallocated NVM Capacity:           0
Controller ID:                      4
NVMe Version:                       1.2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1 024 209 543 168 [1,02 TB]
Namespace 1 Utilization:            201 312 768 000 [201 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 8a91b0fb74
Local Time is:                      Wed Mar  1 11:56:48 2023 CET
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     81 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.02W       -        -    0  0  0  0        0       0
 1 +     6.30W       -        -    1  1  1  1        0       0
 2 +     3.50W       -        -    2  2  2  2        0       0
 3 -   0.0760W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x04
Temperature:                        46 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    255%
Data Units Read:                    574 863 101 [294 TB]
Data Units Written:                 6 334 094 942 [3,24 PB]
Host Read Commands:                 2 067 369 623
Host Write Commands:                12 245 676 869
Controller Busy Time:               1 624 178 045
Power Cycles:                       10
Power On Hours:                     9 368
Unsafe Shutdowns:                   8
Media and Data Integrity Errors:    0
Error Information Log Entries:      36
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               46 Celsius
Temperature Sensor 2:               51 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0         36     0  0x201c  0x4004      -            0     0     -

Actions #3

Updated by okurz almost 2 years ago

  • Tags set to infra
Actions #4

Updated by mkittler almost 2 years ago

  • Related to action #125213: Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de) added
Actions #5

Updated by mkittler almost 2 years ago

Also see my comments on #125213 about those two workers. Unfortunately the crash dump wasn't there anymore when I had a look. So I'm not only wondering why it crashed but also:

  • Why was the crash dump removed?
  • Is it related to the crash of #125207 which showed the exact same symptom (except the critical warning from smart)?
Actions #6

Updated by mkittler almost 2 years ago

Note that the problematic SSD is nvme0n1 which is not the drive where the root filesystem is installed. So I wouldn't expect a kernel panic just from that (only /var/lib/openqa is on that SSD).

Actions #7

Updated by osukup almost 2 years ago

mkittler wrote:

Also see my comments on #125213 about those two workers. Unfortunately the crash dump wasn't there anymore when I had a look. So I'm not only wondering why it crashed but also:

  • Why was the crash dump removed?
  • Is it related to the crash of #125207 which showed the exact same symptom (except the critical warning from smart)?

I moved crashdump to my home -> to be albe start service, but still we have it..

Both crashes were in time of btrfs maintenance.. and yes, I don't think nvme problems are related to crash.

Actions #8

Updated by mkittler almost 2 years ago

The vanished crash dump was actually moved to /home/osukup/2023-03-01-00:35/dmesg.txt. So everything went fine regarding the crash reporting and our systemd service used to trigger an alert.

Like on worker11 BTRFS was re-balancing but there were no hardware errors reported. I have attached the dmesg logs for details.

Actions #9

Updated by okurz almost 2 years ago

  • Subject changed from worker13 host up alert to worker13 host up alert - kernel crash size:M
  • Description updated (diff)
  • Status changed from New to Resolved
  • Assignee set to mkittler
Actions

Also available in: Atom PDF