action #125210: worker13 host up alert - kernel crash size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #125210

closed

worker13 host up alert - kernel crash size:M

Added by livdywan about 2 years ago. Updated almost 2 years ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-03-01

Due date:

% Done:

Estimated time:

Tags:

infra

Description

Observation¶

Several alert emails about worker13 being (un)available.

Acceptance criteria¶

AC1: No alerts about worker13

Suggestions¶

Confirm what happened to the machine
Ensure worker13 is stable, e.g. follow https://monitor.qa.suse.de/d/WDworker13/worker-dashboard-worker13?orgId=1&viewPanel=65105&from=1677856992164&to=1678104302256
Investigate if there's any recent changes e.g. disk space running out, too many jobs, package installed manually or other things
If the problem does not reproduce then just delete crash dumps and resolve, otherwise ask for hardware replacement

Files

dmesg-worker13.log (106 KB) dmesg-worker13.log

mkittler, 2023-03-06 09:59

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by livdywan about 2 years ago

Copied from action #125207: worker11 host up alert - similar as for worker13 added

Actions

Copy link

Updated by osukup about 2 years ago

Kernel crash ..

smartctl reports:

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZVLB1T0HALR-00000
Serial Number:                      S3W6NX0MA02169
Firmware Version:                   EXA7301Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1 024 209 543 168 [1,02 TB]
Unallocated NVM Capacity:           0
Controller ID:                      4
NVMe Version:                       1.2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1 024 209 543 168 [1,02 TB]
Namespace 1 Utilization:            201 312 768 000 [201 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 8a91b0fb74
Local Time is:                      Wed Mar  1 11:56:48 2023 CET
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     81 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.02W       -        -    0  0  0  0        0       0
 1 +     6.30W       -        -    1  1  1  1        0       0
 2 +     3.50W       -        -    2  2  2  2        0       0
 3 -   0.0760W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x04
Temperature:                        46 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    255%
Data Units Read:                    574 863 101 [294 TB]
Data Units Written:                 6 334 094 942 [3,24 PB]
Host Read Commands:                 2 067 369 623
Host Write Commands:                12 245 676 869
Controller Busy Time:               1 624 178 045
Power Cycles:                       10
Power On Hours:                     9 368
Unsafe Shutdowns:                   8
Media and Data Integrity Errors:    0
Error Information Log Entries:      36
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               46 Celsius
Temperature Sensor 2:               51 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0         36     0  0x201c  0x4004      -            0     0     -

Actions

Copy link

Updated by okurz about 2 years ago

Tags set to infra

Actions

Copy link

Updated by mkittler about 2 years ago

Related to action #125213: Failed systemd services alert due do crash dumps on worker11 and worker13 (except openqa.suse.de) added

Actions

Copy link

Updated by mkittler about 2 years ago

Also see my comments on #125213 about those two workers. Unfortunately the crash dump wasn't there anymore when I had a look. So I'm not only wondering why it crashed but also:

Why was the crash dump removed?
Is it related to the crash of #125207 which showed the exact same symptom (except the critical warning from smart)?

Actions

Copy link

Updated by mkittler about 2 years ago

Note that the problematic SSD is nvme0n1 which is not the drive where the root filesystem is installed. So I wouldn't expect a kernel panic just from that (only /var/lib/openqa is on that SSD).

Actions

Copy link

Updated by osukup almost 2 years ago

mkittler wrote:

Also see my comments on #125213 about those two workers. Unfortunately the crash dump wasn't there anymore when I had a look. So I'm not only wondering why it crashed but also:

Why was the crash dump removed?

Is it related to the crash of #125207 which showed the exact same symptom (except the critical warning from smart)?

I moved crashdump to my home -> to be albe start service, but still we have it..

Both crashes were in time of btrfs maintenance.. and yes, I don't think nvme problems are related to crash.

Actions

Copy link

Updated by mkittler almost 2 years ago

File dmesg-worker13.log dmesg-worker13.log added

The vanished crash dump was actually moved to /home/osukup/2023-03-01-00:35/dmesg.txt. So everything went fine regarding the crash reporting and our systemd service used to trigger an alert.

Like on worker11 BTRFS was re-balancing but there were no hardware errors reported. I have attached the dmesg logs for details.

Actions

Copy link

Updated by okurz almost 2 years ago

Subject changed from worker13 host up alert to worker13 host up alert - kernel crash size:M
Description updated (diff)
Status changed from New to Resolved
Assignee set to mkittler

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #125210

worker13 host up alert - kernel crash size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by livdywan about 2 years ago

Updated by osukup about 2 years ago

Updated by okurz about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by osukup almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by okurz almost 2 years ago