Project

General

Profile

Actions

action #49694

closed

openqaworker7 lost one NVMe

Added by nicksinger about 5 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
-
Target version:
Start date:
2019-03-26
Due date:
% Done:

0%

Estimated time:

Description

One of our workers lost one of its NVMe's. The device still shows up in the PCI bus:

81:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01) (prog-if 02 [NVM Express])
    Subsystem: Intel Corporation DC P3600 SSD [2.5" SFF]
    Physical Slot: 4
    Flags: bus master, fast devsel, latency 0, IRQ 31
    Memory at fbe10000 (64-bit, non-prefetchable) [size=16K]
    Expansion ROM at fbe00000 [disabled] [size=64K]
    Capabilities: [40] Power Management version 3
    Capabilities: [50] MSI-X: Enable+ Count=32 Masked-
    Capabilities: [60] Express Endpoint, MSI 00
    Capabilities: [100] Advanced Error Reporting
    Capabilities: [150] Virtual Channel
    Capabilities: [180] Power Budgeting <?>
    Capabilities: [190] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [270] Device Serial Number 55-cd-2e-40-4c-73-1e-2d
    Capabilities: [2a0] #19
    Kernel driver in use: nvme
    Kernel modules: nvme

82:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01) (prog-if 02 [NVM Express])
    Subsystem: Intel Corporation DC P3600 SSD [2.5" SFF]
    Physical Slot: 5
    Flags: bus master, fast devsel, latency 0, IRQ 35
    Memory at fbd10000 (64-bit, non-prefetchable) [size=16K]
    Expansion ROM at fbd00000 [disabled] [size=64K]
    Capabilities: [40] Power Management version 3
    Capabilities: [50] MSI-X: Enable+ Count=32 Masked-

But in dmesg you can see:

[ 2590.917219] nvme nvme0: resetting controller
[ 2592.371347] nvme 0000:81:00.0: Could not set queue count (6)
[ 2592.371352] nvme nvme0: IO queues not created

I've installed the nvme-cli tools to check further details but it seems like the controller refuses to work:

openqaworker7:~ # nvme error-log /dev/nvme0
NVMe Status:INTERNAL(6)

This is how it should look like (tested on the other NVMe):

openqaworker7:~ # nvme error-log /dev/nvme1
Error Log Entries for device:nvme1 entries:64
.................
 Entry[ 0]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[ 1]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[ 2]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[ 3]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[ 4]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[ 5]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[ 6]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[ 7]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[ 8]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[ 9]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[10]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[11]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[12]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[13]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[14]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[15]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[16]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[17]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[18]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[19]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[20]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[21]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[22]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[23]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[24]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[25]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[26]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[27]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[28]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[29]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[30]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[31]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[32]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[33]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[34]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[35]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[36]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[37]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[38]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[39]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[40]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[41]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[42]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[43]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[44]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[45]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[46]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[47]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[48]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[49]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[50]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[51]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[52]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[53]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[54]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[55]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[56]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[57]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[58]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[59]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[60]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[61]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[62]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................
 Entry[63]   
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
.................

Resetting the disk/controller with nvme reset /dev/nvme0 just yields another dmesg entry as above.


Related issues 3 (0 open3 closed)

Related to openQA Infrastructure - action #19238: setup pool devices+mounts+folders with salt(was: ext2 on workers busted)Resolvedokurz2017-05-19

Actions
Related to openQA Infrastructure - action #64685: openqaworker1 showing NVMe problems "kernel: nvme nvme0: Abort status: 0x0"Resolvedokurz2020-03-202020-10-23

Actions
Related to openQA Infrastructure - action #77011: openqaworker7 (o3) is stuck in "recovery mode" as visible over IPMI SoLResolvedfavogt2020-11-05

Actions
Actions #1

Updated by nicksinger about 5 years ago

  • Description updated (diff)
Actions #2

Updated by okurz almost 5 years ago

  • Project changed from openQA Project to openQA Infrastructure
Actions #3

Updated by acarvajal over 4 years ago

Could https://progress.opensuse.org/issues/54074 be a consequence of this?

Actions #4

Updated by okurz over 4 years ago

  • Related to action #19238: setup pool devices+mounts+folders with salt(was: ext2 on workers busted) added
Actions #5

Updated by nicksinger about 4 years ago

  • Status changed from In Progress to Feedback

I'm somehow out of ideas what to do with this. The machine is clearly out-of-service and IIRC we planned some budget to replace the whole machine. Therefore we might be able to just order a replacement?

Actions #6

Updated by nicksinger about 4 years ago

I've asked Ralf Unger for our current budget plans and if we can reorder an (identical) replacement.

Actions #7

Updated by okurz about 4 years ago

Currently the machine runs just fine with a single NVMe with the same setup we have on other machines that also have just one NVMe. openqaworker7 could be a candidate to move to o3 network.

EDIT: I asked for the machine to be moved to o3. See #62162 for details.

Actions #8

Updated by okurz about 4 years ago

just to make sure we have a common understanding: openqaworker7 is up and running within the o3 infrastructure but is still missing the second NVMe hence running with reduced amount of worker instances.

@nicksinger Did I understand that you still plan to get a replacement for the missing NVMe? Because SUSE IT in the person of gschlotter already helped and resolved the similar situation for openqaworker4 and openqaworker13 so we might be better of asking them for this one as well.

Actions #9

Updated by nicksinger about 4 years ago

  • Status changed from Feedback to In Progress

Yes, together with the order of openqaworker1's NVMe. Also for the record: Currently we have SFF-8639 connected NVMe's which is a connector which seems to get extinct in the future. I checked both machines (openqaworker1, openqaworker7) and they should both have a PCIe x16 slot free (checked with dmidecode -t slot). Therefore I will order cheaper m2 NVMe's and a PCIe<->m2 adapter card.

Actions #10

Updated by nicksinger about 4 years ago

  • Status changed from In Progress to Feedback

I've created REQ_402391:

Hello everybody,

I kindly ask you to order these items:
2x https://geizhals.de/intel-ssd-660p-1tb-ssdpeknw010t801-ssdpeknw010t8xt-ssdpeknw010t8x1-a1859204.html
2x https://geizhals.de/asrock-ultra-quad-m-2-card-90-cxg630-00uanz-a1787970.html

The ASRock adapter card is more of a suggestion and we can order what the our vendor has in stock.

Please also tell me when exactly the order will arrive (best with tracking from the shipping company) since due to the current covid-19 situation we need to make sure somebody in Nuremberg is available to receive the parcel.

Thanks in advance,
  Nick

(for transparency, since most people won't be able to see the request in RIO)

Actions #11

Updated by okurz about 4 years ago

Thanks for sharing this information. This is always better than just "there is ticket which you can't see" :)

Actions #12

Updated by nicksinger about 4 years ago

RIO is weird but it seems something is happening:

    'No'    by Nick Singer  -25/03/2020 09:57:39 AM View Action Details
    'Assign'    by Vinothraja Karthikeyan   -25/03/2020 10:39:14 AM View Action Details
    'Business Approval not Needed'  by Vinothraja Karthikeyan   -25/03/2020 10:39:14 AM View Action Details
    'Financial Approval not Needed' by Vinothraja Karthikeyan   -25/03/2020 10:39:14 AM View Action Details
    'Purchasing not Needed' by Vinothraja Karthikeyan   -25/03/2020 10:39:14 AM View Action Details
    'Assign to Fulfillment Queue'   by Vinothraja Karthikeyan   -25/03/2020 10:39:15 AM View Action Details
    'Assign'    by Jared Disbrow    -26/03/2020 03:50:31 PM View Action Details
    'Assign to Fulfillment Queue'   by Jared Disbrow    -26/03/2020 03:50:36 PM View Action Details
    'Assign'    by Vinothraja Karthikeyan   -27/03/2020 01:26:10 AM View Action Details
    'Assign to Fulfillment Queue'   by Vinothraja Karthikeyan   -27/03/2020 01:26:11 AM View Action Details
Actions #13

Updated by livdywan almost 4 years ago

nicksinger wrote:

RIO is weird but it seems something is happening:

  'No'    by Nick Singer  -25/03/2020 09:57:39 AM View Action Details
  'Assign'    by Vinothraja Karthikeyan   -25/03/2020 10:39:14 AM View Action Details
  'Business Approval not Needed'  by Vinothraja Karthikeyan   -25/03/2020 10:39:14 AM View Action Details
  'Financial Approval not Needed' by Vinothraja Karthikeyan   -25/03/2020 10:39:14 AM View Action Details
  'Purchasing not Needed' by Vinothraja Karthikeyan   -25/03/2020 10:39:14 AM View Action Details
  'Assign to Fulfillment Queue'   by Vinothraja Karthikeyan   -25/03/2020 10:39:15 AM View Action Details
  'Assign'    by Jared Disbrow    -26/03/2020 03:50:31 PM View Action Details
  'Assign to Fulfillment Queue'   by Jared Disbrow    -26/03/2020 03:50:36 PM View Action Details
  'Assign'    by Vinothraja Karthikeyan   -27/03/2020 01:26:10 AM View Action Details
  'Assign to Fulfillment Queue'   by Vinothraja Karthikeyan   -27/03/2020 01:26:11 AM View Action Details

Any update in the meantime?

Actions #14

Updated by nicksinger almost 4 years ago

I got a notification from Ralf yesterday that the disk arrived. Since I have to go to the office this week anyway I'll try to drop it somewhere where Infra could pick it up and replace it.

Actions #15

Updated by okurz almost 4 years ago

  • Priority changed from Normal to Low

effectively we have a good workaround with our salt changes to use any NVMe devices as available so the effect is actually not that high so changing to "Low".

Actions #16

Updated by livdywan over 3 years ago

nicksinger wrote:

I got a notification from Ralf yesterday that the disk arrived. Since I have to go to the office this week anyway I'll try to drop it somewhere where Infra could pick it up and replace it.

@nicksinger Do you have an update on what happened to the disk?

Actions #17

Updated by okurz over 3 years ago

  • Target version set to Ready
Actions #18

Updated by okurz over 3 years ago

  • Due date set to 2020-09-16

runger has received hardware that could likely be our replacement hardware so I hope that nsinger will handle the hardware and have it included in the setups we have available.

Actions #19

Updated by nicksinger over 3 years ago

Update: we have two of these adapters and one m2 SSD/NVMe. Gerhard now took one of these adapters, will plug in the m2 and try to build it in either openqaworker1 or openqaworker7. Once we have feedback if the adapters fit the server chassis I will order a second m2 (apparently the second one never arrived).

Actions #20

Updated by nicksinger over 3 years ago

due to the server position in the rack we will now build in the first card into openqaworker7. The according infra ticket is: https://infra.nue.suse.com/SelfService/Display.html?id=176945

Actions #21

Updated by nicksinger over 3 years ago

  • Related to action #64685: openqaworker1 showing NVMe problems "kernel: nvme nvme0: Abort status: 0x0" added
Actions #22

Updated by nicksinger over 3 years ago

  • Status changed from Feedback to Blocked

The adapter card is roughly 1cm too long. I will now order 2 new ones and another SSD/NVMe.

Actions #23

Updated by nicksinger over 3 years ago

  • Status changed from Blocked to Feedback

Requested from mgriessmeier:

2x https://www.csv-direct.de/artinfo.php?artnr=A1902217&KATEGORIE=1902
1x https://www.csv-direct.de/artinfo.php?artnr=A0204212&KATEGORIE=0204
Actions #24

Updated by nicksinger over 3 years ago

  • Status changed from Feedback to Workable

The NVMe was build into the machine today. You can check what block device is associated to the right disk by comparing the IDs of lspci with the reported paths from udev:

openqaworker7:~ # lspci | grep Non-Volatile
81:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
82:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
83:00.0 Non-Volatile memory controller: Intel Corporation SSD 660P Series (rev 03)        #<- this is the new one
openqaworker7:~ # udevadm info -q all -n /dev/nvme2
P: /devices/pci0000:80/0000:80:02.0/0000:83:00.0/nvme/nvme2
N: nvme2
E: DEVNAME=/dev/nvme2
E: DEVPATH=/devices/pci0000:80/0000:80:02.0/0000:83:00.0/nvme/nvme2                       #<- "83:00.0" from lspci shows that this NVMe is available at /dev/nvme2
E: MAJOR=245
E: MINOR=2
E: SUBSYSTEM=nvme
Actions #25

Updated by okurz over 3 years ago

@nicksinger you set the ticket to "Workable" but keep yourself as assignee so I assume you plan to followup yourself. Any help needed?

Actions #26

Updated by nicksinger over 3 years ago

  • Assignee deleted (nicksinger)

okurz wrote:

@nicksinger you set the ticket to "Workable" but keep yourself as assignee so I assume you plan to followup yourself. Any help needed?

indeed there is work left. The new disk needs to be used and the old one removed. But you're right, nothing only I can do

Actions #27

Updated by okurz over 3 years ago

  • Due date changed from 2020-09-16 to 2020-10-23
  • Status changed from Workable to Blocked
  • Assignee set to okurz
  • Priority changed from Low to Normal
Actions #28

Updated by okurz over 3 years ago

  • Due date deleted (2020-10-23)
  • Priority changed from Normal to Low

The new NMVe is automatically used, see:

# lsblk 
NAME    MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda       8:0    0 931.5G  0 disk  
└─sda1    8:1    0 929.5G  0 part  /
sdb       8:16   0 931.5G  0 disk  
└─sdb1    8:17   0 931.5G  0 part  
nvme2n1 259:0    0 953.9G  0 disk  
└─md127   9:127  0   1.3T  0 raid0 /var/lib/openqa
nvme1n1 259:1    0 372.6G  0 disk  
└─md127   9:127  0   1.3T  0 raid0 /var/lib/openqa
…
# df -h
Filesystem          Size  Used Avail Use% Mounted on
…
/dev/md127          1.3T  102G  1.2T   9% /var/lib/openqa

I could not read out the serial number of the broken NVMe as it seems the device is not accessible at all but I informed in the infra ticket about all the other devices that should not be removed :)

Actions #29

Updated by okurz over 3 years ago

  • Related to action #77011: openqaworker7 (o3) is stuck in "recovery mode" as visible over IPMI SoL added
Actions #30

Updated by okurz over 3 years ago

  • Status changed from Blocked to Resolved

In #49694#note-28 I stated that the new NVMe is automatically used while it was called "nvme2". In the meantime the old broken NVMe/SSD was removed and devices are numbered as "nvme0n1" and "nvme1n1". lsblk confirms that both devices are properly used forming a RAID0 with together 1.3TB of data.

Actions

Also available in: Atom PDF