Project

General

Profile

Actions

action #64685

closed

openqaworker1 showing NVMe problems "kernel: nvme nvme0: Abort status: 0x0"

Added by okurz about 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2020-03-20
Due date:
2020-10-23
% Done:

0%

Estimated time:

Description

Observation

[20/03/2020 11:39:23] <DimStar> Martchus_: any idea what's up here? https://openqa.opensuse.org/tests/overview?arch=&machine=&modules=&todo=1&distri=microos&distri=opensuse&version=Tumbleweed&build=20200318&groupid=1# ?
[20/03/2020 11:52:02] <guillaume_g> Defolos: fyi, https://bugzilla.opensuse.org/show_bug.cgi?id=1167232
[20/03/2020 11:52:05] <|Anna|> openSUSE bug 1167232 in openSUSE Tumbleweed "Vagrant Tumbleweed 20200317 fails due to unsupported configuration PS2" [Normal, New]
[20/03/2020 11:53:02] <DimStar> okurz: the nvme issues we had last time around was ow4, right?

On w1 in system journal:

nvme nvme0: Abort status: 0x0

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #49694: openqaworker7 lost one NVMeResolvedokurz2019-03-26

Actions
Actions #1

Updated by okurz about 4 years ago

According to journalctl | grep 'kernel: nvme nvme0: Abort status: 0x0' the first error appeared already on

Mar 18 16:18:20 openqaworker1 kernel: nvme nvme0: Abort status: 0x0

I can try to reformat the filesystem but I guess the device is broken. Maybe after all (thinking again) ext2 is not a good choice?

Actions #2

Updated by okurz about 4 years ago

Recreating the RAID and creating a filesystem can reproduce the problem on nvme0

Mar 20 16:57:37 openqaworker1 kernel: md127: detected capacity change from 799906201600 to 0
Mar 20 16:57:37 openqaworker1 kernel: md: md127 stopped.
Mar 20 16:57:48 openqaworker1 kernel:  nvme0n1: p1
Mar 20 16:57:48 openqaworker1 kernel:  nvme0n1: p1
Mar 20 16:57:48 openqaworker1 kernel: md127: detected capacity change from 0 to 799906201600
Mar 20 16:57:48 openqaworker1 kernel:  nvme1n1: p1
Mar 20 16:58:37 openqaworker1 kernel: nvme nvme0: I/O 0 QID 23 timeout, aborting
Mar 20 16:58:37 openqaworker1 kernel: nvme nvme0: I/O 1 QID 23 timeout, aborting
Mar 20 16:58:37 openqaworker1 kernel: nvme nvme0: I/O 2 QID 23 timeout, aborting
Mar 20 16:58:37 openqaworker1 kernel: nvme nvme0: I/O 3 QID 23 timeout, aborting
Mar 20 16:58:37 openqaworker1 kernel: nvme nvme0: Abort status: 0x0
Mar 20 16:58:42 openqaworker1 kernel: nvme nvme0: Abort status: 0x0
Mar 20 16:58:42 openqaworker1 kernel: nvme nvme0: Abort status: 0x0
Mar 20 16:58:42 openqaworker1 kernel: nvme nvme0: Abort status: 0x0

I had to reboot w1 as it's I/O stuck. I could reproduce the NVMe problems when recreating the RAID0 used for /var/lib/openqa and trying to create a new filesystem on it. After reboot I will try to create a RAID0 from just the single NVMe1 device and bring up limited openqa worker instances which are also MM capable. On second priority I will add multi-machine support to w4 and w7, see #64700 , which I plan to do only later to not disrupt already currently running jobs not to end up with no workers running at all due to misconfiguration.

w1 back up with 12 worker instances, single NVMe.

Actions #3

Updated by okurz about 4 years ago

  • Status changed from In Progress to Blocked

Created https://infra.nue.suse.com/SelfService/Display.html?id=166270 asking for replacement of the broken device.

Actions #4

Updated by okurz about 4 years ago

  • Status changed from Blocked to Feedback
  • Assignee changed from okurz to nicksinger
  • Priority changed from Urgent to High

ticket was resolved as nicksinger stated that he will order a replacement so assigning ticket to him.

Reducing from "Urgent" to "High" as the machine was brought back up by me with the single NVMe present.

@nicksinger ETA?

Actions #5

Updated by okurz about 4 years ago

pinged in RC

Actions #6

Updated by nicksinger about 4 years ago

"REQ_402391: Order request: NVMe replacement for openqaworker1 and openqaworker7" is in "FULFILLMENT QUEUE" since 25/03/2020 09:57:39 AM. I've added a "note" asking if and how I can speed things up.

Actions #7

Updated by okurz about 4 years ago

  • Due date set to 2020-06-01
  • Status changed from Feedback to Blocked

fine. It's actually not that urgent as we can live with a single NVMe for some time, just wanted to know what we can expect. So I am setting this to "Blocked" with a due date for you then accordingly.

Actions #8

Updated by nicksinger almost 4 years ago

Since no progress is visible in the RIO ticket I escalated to Ralf.

Actions #9

Updated by okurz almost 4 years ago

What's the result of the escalation after 2 months? You should also have received reminder emails about this ticket after the due date has passed. I am coming back to this topic because fvogt found the issue "again" in logs.

Actions #10

Updated by mgriessmeier almost 4 years ago

they got delivered meanwhile

Actions #11

Updated by okurz over 3 years ago

  • Target version set to Ready
Actions #12

Updated by okurz over 3 years ago

  • Tags changed from caching, openQA, sporadic, arm, ipmi, worker to sporadic, worker
Actions #13

Updated by okurz over 3 years ago

@nsinger are you receiving reminders about this ticket like twice a week? I guess you can also use the NVMe from runger here

Actions #14

Updated by nicksinger over 3 years ago

Update: we have two of these adapters and one m2 SSD/NVMe. Gerhard now took one of these adapters, will plug in the m2 and try to build it in either openqaworker1 or openqaworker7. Once we have feedback if the adapters fit the server chassis I will order a second m2 (apparently the second one never arrived).

Actions #15

Updated by nicksinger over 3 years ago

Actions #16

Updated by nicksinger over 3 years ago

  • Status changed from Blocked to Feedback

Waiting for the hardware mentioned in https://progress.opensuse.org/issues/49694#note-22

Actions #17

Updated by nicksinger over 3 years ago

  • Status changed from Feedback to Workable

The NVMe was build into the machine today. You can check what block device is associated to the right disk by comparing the IDs of lspci with the reported paths from udev:

openqaworker1:~ # lspci | grep Non-Volatile
81:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
82:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
83:00.0 Non-Volatile memory controller: Intel Corporation SSD 660P Series (rev 03)        #<- this is the new one
openqaworker1:~ # udevadm info -q all -n /dev/nvme2
P: /devices/pci0000:80/0000:80:02.0/0000:83:00.0/nvme/nvme2
N: nvme2
E: DEVNAME=/dev/nvme2
E: DEVPATH=/devices/pci0000:80/0000:80:02.0/0000:83:00.0/nvme/nvme2                       #<- "83:00.0" from lspci shows that this NVMe is available at /dev/nvme2
E: MAJOR=245
E: MINOR=2
E: SUBSYSTEM=nvme
Actions #18

Updated by okurz over 3 years ago

ok but now we have the unused nvme0 still in the system. Can you rip it out and discard it? Then we can rely on the semi-automatic provisioning of the device same as for other workers. As an alternative we can manually add the second NVMe to the RAID0, forget that nmve0 is faulty, and eventually someone runs into the same problem again with nvme0 ;)

Actions #19

Updated by okurz over 3 years ago

@nicksinger can you please describe what is the current state for this ticket, what are your plans, suggestions, things to do or things to wait for?

Actions #20

Updated by nicksinger over 3 years ago

okurz wrote:

@nicksinger can you please describe what is the current state for this ticket, what are your plans, suggestions, things to do or things to wait for?
the SSD is built in now. What's left is making use of it and pulling out the old one. As we have two "old" ones in there we need some kind of coordination with Infra when they pull it so see if it's the right one

Actions #21

Updated by okurz over 3 years ago

nicksinger wrote:

the SSD is built in now. What's left is making use of it and pulling out the old one. As we have two "old" ones in there we need some kind of coordination with Infra when they pull it so see if it's the right one

Well, lsblk -o NAME,MODEL,SERIAL tells me:

NAME    MODEL                                    SERIAL
sda     HGST HTE721010A9                         
├─sda1                                           
├─sda2                                           
└─sda3                                           
  └─md1                                          
nvme2n1 INTEL SSDPEKNW010T8                      BTNH022600SN1P0B    
nvme1n1 INTEL SSDPE2ME400G4                      CVMD4352001W400FGN  
└─md127                                          
nvme0n1 INTEL SSDPE2ME400G4                      CVMD4352007S400FGN  

so can you ask your Infra contact person like gschlotter to rip out the device with serial id CVMD4352007S400FGN whenever the machine is not fully utilized? You could either reopen https://infra.nue.suse.com/SelfService/Display.html?id=176945 or create a new ticket.

Actions #22

Updated by nicksinger over 3 years ago

  • Assignee deleted (nicksinger)
  • Priority changed from High to Low

I doubt infra can tell the tray if you give them a serial number. I know some servers can blink a light at a specific tray slot but I don't know if this is supported here. After all it should definitely be combined with the efforts to remove the broken disk from worker7 (see https://progress.opensuse.org/issues/49694).

Actions #23

Updated by okurz over 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz
  • Priority changed from Low to High

fine. But I would like to have this handled still with high prio. I will create corresponding infra tickets then.

Actions #24

Updated by okurz over 3 years ago

  • Due date changed from 2020-06-01 to 2020-10-23
  • Status changed from In Progress to Blocked
  • Priority changed from High to Normal
Actions #25

Updated by okurz over 3 years ago

  • Status changed from Blocked to Feedback

ticket resolved, machine was powered on but not coming up properly, reported by fvogt. Could not easily recover myself due to problem with ipmitool but version of ipmitool from "wotan" was ok. fvogt fixed it with https://paste.opensuse.org/view/raw/34762541 , i.e.:

Issue:
error: symbol `grub_file_filters' not found.
grub rescue> 
-> Mismatch of installed core.img and modules

Idea:
In the rescue shell, set the prefix to the location of modules from an older snapshot

Procedure:
set btrfs_relative_path=n
ls (mduuid/958978b17edb5f3df27f5be3c639f19b)/@/.snapshots/
-> oldest is 755/
set btrfs_subvol=/@/.snapshots/755/snapshot
set btrfs_relative_path=y
set prefix=/usr/share/grub2/
insmod normal
normal
configfile /boot/grub2/grub.cfg
-> System boots

system is up, we need to update the RAID config but the system is currently working hard on a new Tumbleweed snapshot. I can also do that at a different time :) What we can do What I did then:

mdadm --stop /dev/md/openqa
mdadm --create /dev/md/openqa --level=0 --force --raid-devices=2 --run /dev/nvme?n1
mdadm --detail --scan >> /etc/mdadm.conf
vim /etc/mdadm.conf
mkfs.ext2 /dev/md/openqa
mkdir -p /var/lib/openqa/{pool,cache,share} && /usr/bin/chown _openqa-worker:root /var/lib/openqa/cache /var/lib/openqa/pool /var/lib/openqa/share
systemctl start openqa-worker-cacheservice openqa-worker-cacheservice-minion openqa-worker.target
systemctl unmask openqa-worker@{13..20} && systemctl enable --now openqa-worker@{13..20}

including enabling more worker instances as we have now 1.3TB free pool space and also that many worker instances on openqaworker4.

Next thing I checked for the CACHELIMIT settings and increasing here as well:

> for i in aarch64 openqaworker1 openqaworker4 openqaworker7 power8 imagetester rebel; do echo $i && ssh root@$i "grep -B 2 CACHELIMIT /etc/openqa/workers.ini" ; done
aarch64
# okurz: 2020-05-18: jobs incompleted with "No space left on device",
# reducing 200->160
CACHELIMIT = 160
openqaworker1
openqaworker4
WORKER_HOSTNAME=192.168.112.7
CACHEDIRECTORY = /var/lib/openqa/cache
CACHELIMIT = 50
openqaworker7
# cache had been fully used to 50G and free disk space was 200G so that we
# can we can proably increase this limit
CACHELIMIT = 180
power8
CACHEDIRECTORY = /var/lib/openqa/cache
# okurz: 2020-05-09: Increased cache limit from default 50G as we have enough space
CACHELIMIT = 600
imagetester
WORKER_HOSTNAME=192.168.112.5
CACHEDIRECTORY = /var/lib/openqa/cache
CACHELIMIT = 50
rebel

Have increased cache limit to 400GB, commented in /etc/openqa/workers.ini. Also informed common users in irc://chat.freenode.net/opensuse-factory and will monitor.

Actions #26

Updated by okurz over 3 years ago

  • Status changed from Feedback to Resolved

there seemed to be one problem that I have enabled more worker instances but they did not have a tap device assigned. fvogt has resolved that in another ticket with changing /etc/openqa/workers.ini . Currently I see /dev/md127 1.3T 526G 715G 43% /var/lib/openqa and load average: 9.74, 12.20, 12.90. I think everything is ok here.

Actions

Also available in: Atom PDF