Project

General

Profile

Actions

action #49694

closed

openqaworker7 lost one NVMe

Added by nicksinger about 5 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
-
Target version:
Start date:
2019-03-26
Due date:
% Done:

0%

Estimated time:

Description

One of our workers lost one of its NVMe's. The device still shows up in the PCI bus:

81:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01) (prog-if 02 [NVM Express])
    Subsystem: Intel Corporation DC P3600 SSD [2.5" SFF]
    Physical Slot: 4
    Flags: bus master, fast devsel, latency 0, IRQ 31
    Memory at fbe10000 (64-bit, non-prefetchable) [size=16K]
    Expansion ROM at fbe00000 [disabled] [size=64K]
    Capabilities: [40] Power Management version 3
    Capabilities: [50] MSI-X: Enable+ Count=32 Masked-
    Capabilities: [60] Express Endpoint, MSI 00
    Capabilities: [100] Advanced Error Reporting
    Capabilities: [150] Virtual Channel
    Capabilities: [180] Power Budgeting <?>
    Capabilities: [190] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [270] Device Serial Number 55-cd-2e-40-4c-73-1e-2d
    Capabilities: [2a0] #19
    Kernel driver in use: nvme
    Kernel modules: nvme

82:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01) (prog-if 02 [NVM Express])
    Subsystem: Intel Corporation DC P3600 SSD [2.5" SFF]
    Physical Slot: 5
    Flags: bus master, fast devsel, latency 0, IRQ 35
    Memory at fbd10000 (64-bit, non-prefetchable) [size=16K]
    Expansion ROM at fbd00000 [disabled] [size=64K]
    Capabilities: [40] Power Management version 3
    Capabilities: [50] MSI-X: Enable+ Count=32 Masked-

But in dmesg you can see:

[ 2590.917219] nvme nvme0: resetting controller
[ 2592.371347] nvme 0000:81:00.0: Could not set queue count (6)
[ 2592.371352] nvme nvme0: IO queues not created

I've installed the nvme-cli tools to check further details but it seems like the controller refuses to work:

openqaworker7:~ # nvme error-log /dev/nvme0
NVMe Status:INTERNAL(6)

This is how it should look like (tested on the other NVMe):

openqaworker7:~ # nvme error-log /dev/nvme1
Error Log Entries for device:nvme1 entries:64
.................
 Entry[ 0]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[ 1]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[ 2]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[ 3]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[ 4]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[ 5]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[ 6]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[ 7]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[ 8]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[ 9]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[10]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[11]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[12]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[13]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[14]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[15]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[16]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[17]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[18]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[19]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[20]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[21]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[22]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[23]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[24]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[25]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[26]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[27]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[28]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[29]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[30]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[31]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[32]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[33]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[34]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[35]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[36]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[37]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[38]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[39]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[40]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[41]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[42]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[43]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[44]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[45]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[46]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[47]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[48]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[49]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[50]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[51]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[52]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[53]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[54]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[55]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[56]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[57]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[58]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[59]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[60]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[61]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[62]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[63]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................

Resetting the disk/controller with nvme reset /dev/nvme0 just yields another dmesg entry as above.


Related issues 3 (0 open3 closed)

Related to openQA Infrastructure - action #19238: setup pool devices+mounts+folders with salt(was: ext2 on workers busted)Resolvedokurz2017-05-19

Actions
Related to openQA Infrastructure - action #64685: openqaworker1 showing NVMe problems "kernel: nvme nvme0: Abort status: 0x0"Resolvedokurz2020-03-202020-10-23

Actions
Related to openQA Infrastructure - action #77011: openqaworker7 (o3) is stuck in "recovery mode" as visible over IPMI SoLResolvedfavogt2020-11-05

Actions
Actions

Also available in: Atom PDF