Project

General

Profile

action #64685

openqaworker1 showing NVMe problems "kernel: nvme nvme0: Abort status: 0x0"

Added by okurz over 1 year ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2020-03-20
Due date:
2020-10-23
% Done:

0%

Estimated time:

Description

Observation

[20/03/2020 11:39:23] <DimStar> Martchus_: any idea what's up here? https://openqa.opensuse.org/tests/overview?arch=&machine=&modules=&todo=1&distri=microos&distri=opensuse&version=Tumbleweed&build=20200318&groupid=1# ?
[20/03/2020 11:52:02] <guillaume_g> Defolos: fyi, https://bugzilla.opensuse.org/show_bug.cgi?id=1167232
[20/03/2020 11:52:05] <|Anna|> openSUSE bug 1167232 in openSUSE Tumbleweed "Vagrant Tumbleweed 20200317 fails due to unsupported configuration PS2" [Normal, New]
[20/03/2020 11:53:02] <DimStar> okurz: the nvme issues we had last time around was ow4, right?

On w1 in system journal:

nvme nvme0: Abort status: 0x0

Related issues

Related to openQA Infrastructure - action #49694: openqaworker7 lost one NVMeResolved2019-03-26

History

#1 Updated by okurz over 1 year ago

According to journalctl | grep 'kernel: nvme nvme0: Abort status: 0x0' the first error appeared already on

Mar 18 16:18:20 openqaworker1 kernel: nvme nvme0: Abort status: 0x0

I can try to reformat the filesystem but I guess the device is broken. Maybe after all (thinking again) ext2 is not a good choice?

#2 Updated by okurz over 1 year ago

Recreating the RAID and creating a filesystem can reproduce the problem on nvme0

Mar 20 16:57:37 openqaworker1 kernel: md127: detected capacity change from 799906201600 to 0
Mar 20 16:57:37 openqaworker1 kernel: md: md127 stopped.
Mar 20 16:57:48 openqaworker1 kernel:  nvme0n1: p1
Mar 20 16:57:48 openqaworker1 kernel:  nvme0n1: p1
Mar 20 16:57:48 openqaworker1 kernel: md127: detected capacity change from 0 to 799906201600
Mar 20 16:57:48 openqaworker1 kernel:  nvme1n1: p1
Mar 20 16:58:37 openqaworker1 kernel: nvme nvme0: I/O 0 QID 23 timeout, aborting
Mar 20 16:58:37 openqaworker1 kernel: nvme nvme0: I/O 1 QID 23 timeout, aborting
Mar 20 16:58:37 openqaworker1 kernel: nvme nvme0: I/O 2 QID 23 timeout, aborting
Mar 20 16:58:37 openqaworker1 kernel: nvme nvme0: I/O 3 QID 23 timeout, aborting
Mar 20 16:58:37 openqaworker1 kernel: nvme nvme0: Abort status: 0x0
Mar 20 16:58:42 openqaworker1 kernel: nvme nvme0: Abort status: 0x0
Mar 20 16:58:42 openqaworker1 kernel: nvme nvme0: Abort status: 0x0
Mar 20 16:58:42 openqaworker1 kernel: nvme nvme0: Abort status: 0x0

I had to reboot w1 as it's I/O stuck. I could reproduce the NVMe problems when recreating the RAID0 used for /var/lib/openqa and trying to create a new filesystem on it. After reboot I will try to create a RAID0 from just the single NVMe1 device and bring up limited openqa worker instances which are also MM capable. On second priority I will add multi-machine support to w4 and w7, see #64700 , which I plan to do only later to not disrupt already currently running jobs not to end up with no workers running at all due to misconfiguration.

w1 back up with 12 worker instances, single NVMe.

#3 Updated by okurz over 1 year ago

  • Status changed from In Progress to Blocked

Created https://infra.nue.suse.com/SelfService/Display.html?id=166270 asking for replacement of the broken device.

#4 Updated by okurz over 1 year ago

  • Status changed from Blocked to Feedback
  • Assignee changed from okurz to nicksinger
  • Priority changed from Urgent to High

ticket was resolved as nicksinger stated that he will order a replacement so assigning ticket to him.

Reducing from "Urgent" to "High" as the machine was brought back up by me with the single NVMe present.

nicksinger ETA?

#5 Updated by okurz over 1 year ago

pinged in RC

#6 Updated by nicksinger over 1 year ago

"REQ_402391: Order request: NVMe replacement for openqaworker1 and openqaworker7" is in "FULFILLMENT QUEUE" since 25/03/2020 09:57:39 AM. I've added a "note" asking if and how I can speed things up.

#7 Updated by okurz over 1 year ago

  • Due date set to 2020-06-01
  • Status changed from Feedback to Blocked

fine. It's actually not that urgent as we can live with a single NVMe for some time, just wanted to know what we can expect. So I am setting this to "Blocked" with a due date for you then accordingly.

#8 Updated by nicksinger about 1 year ago

Since no progress is visible in the RIO ticket I escalated to Ralf.

#9 Updated by okurz about 1 year ago

What's the result of the escalation after 2 months? You should also have received reminder emails about this ticket after the due date has passed. I am coming back to this topic because fvogt found the issue "again" in logs.

#10 Updated by mgriessmeier about 1 year ago

they got delivered meanwhile

#11 Updated by okurz about 1 year ago

  • Target version set to Ready

#12 Updated by okurz 12 months ago

  • Tags changed from caching, openQA, sporadic, arm, ipmi, worker to sporadic, worker

#13 Updated by okurz 11 months ago

@nsinger are you receiving reminders about this ticket like twice a week? I guess you can also use the NVMe from runger here

#14 Updated by nicksinger 11 months ago

Update: we have two of these adapters and one m2 SSD/NVMe. Gerhard now took one of these adapters, will plug in the m2 and try to build it in either openqaworker1 or openqaworker7. Once we have feedback if the adapters fit the server chassis I will order a second m2 (apparently the second one never arrived).

#15 Updated by nicksinger 11 months ago

#16 Updated by nicksinger 11 months ago

  • Status changed from Blocked to Feedback

Waiting for the hardware mentioned in https://progress.opensuse.org/issues/49694#note-22

#17 Updated by nicksinger 11 months ago

  • Status changed from Feedback to Workable

The NVMe was build into the machine today. You can check what block device is associated to the right disk by comparing the IDs of lspci with the reported paths from udev:

openqaworker1:~ # lspci | grep Non-Volatile
81:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
82:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
83:00.0 Non-Volatile memory controller: Intel Corporation SSD 660P Series (rev 03)        #<- this is the new one
openqaworker1:~ # udevadm info -q all -n /dev/nvme2
P: /devices/pci0000:80/0000:80:02.0/0000:83:00.0/nvme/nvme2
N: nvme2
E: DEVNAME=/dev/nvme2
E: DEVPATH=/devices/pci0000:80/0000:80:02.0/0000:83:00.0/nvme/nvme2                       #<- "83:00.0" from lspci shows that this NVMe is available at /dev/nvme2
E: MAJOR=245
E: MINOR=2
E: SUBSYSTEM=nvme

#18 Updated by okurz 11 months ago

ok but now we have the unused nvme0 still in the system. Can you rip it out and discard it? Then we can rely on the semi-automatic provisioning of the device same as for other workers. As an alternative we can manually add the second NVMe to the RAID0, forget that nmve0 is faulty, and eventually someone runs into the same problem again with nvme0 ;)

#19 Updated by okurz 10 months ago

nicksinger can you please describe what is the current state for this ticket, what are your plans, suggestions, things to do or things to wait for?

#20 Updated by nicksinger 10 months ago

okurz wrote:

nicksinger can you please describe what is the current state for this ticket, what are your plans, suggestions, things to do or things to wait for?
the SSD is built in now. What's left is making use of it and pulling out the old one. As we have two "old" ones in there we need some kind of coordination with Infra when they pull it so see if it's the right one

#21 Updated by okurz 10 months ago

nicksinger wrote:

the SSD is built in now. What's left is making use of it and pulling out the old one. As we have two "old" ones in there we need some kind of coordination with Infra when they pull it so see if it's the right one

Well, lsblk -o NAME,MODEL,SERIAL tells me:

NAME    MODEL                                    SERIAL
sda     HGST HTE721010A9                         
├─sda1                                           
├─sda2                                           
└─sda3                                           
  └─md1                                          
nvme2n1 INTEL SSDPEKNW010T8                      BTNH022600SN1P0B    
nvme1n1 INTEL SSDPE2ME400G4                      CVMD4352001W400FGN  
└─md127                                          
nvme0n1 INTEL SSDPE2ME400G4                      CVMD4352007S400FGN  

so can you ask your Infra contact person like gschlotter to rip out the device with serial id CVMD4352007S400FGN whenever the machine is not fully utilized? You could either reopen https://infra.nue.suse.com/SelfService/Display.html?id=176945 or create a new ticket.

#22 Updated by nicksinger 10 months ago

  • Assignee deleted (nicksinger)
  • Priority changed from High to Low

I doubt infra can tell the tray if you give them a serial number. I know some servers can blink a light at a specific tray slot but I don't know if this is supported here. After all it should definitely be combined with the efforts to remove the broken disk from worker7 (see https://progress.opensuse.org/issues/49694).

#23 Updated by okurz 10 months ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz
  • Priority changed from Low to High

fine. But I would like to have this handled still with high prio. I will create corresponding infra tickets then.

#24 Updated by okurz 10 months ago

  • Due date changed from 2020-06-01 to 2020-10-23
  • Status changed from In Progress to Blocked
  • Priority changed from High to Normal

#25 Updated by okurz 9 months ago

  • Status changed from Blocked to Feedback

ticket resolved, machine was powered on but not coming up properly, reported by fvogt. Could not easily recover myself due to problem with ipmitool but version of ipmitool from "wotan" was ok. fvogt fixed it with https://paste.opensuse.org/view/raw/34762541 , i.e.:

Issue:
error: symbol `grub_file_filters' not found.
grub rescue> 
-> Mismatch of installed core.img and modules

Idea:
In the rescue shell, set the prefix to the location of modules from an older snapshot

Procedure:
set btrfs_relative_path=n
ls (mduuid/958978b17edb5f3df27f5be3c639f19b)/@/.snapshots/
-> oldest is 755/
set btrfs_subvol=/@/.snapshots/755/snapshot
set btrfs_relative_path=y
set prefix=/usr/share/grub2/
insmod normal
normal
configfile /boot/grub2/grub.cfg
-> System boots

system is up, we need to update the RAID config but the system is currently working hard on a new Tumbleweed snapshot. I can also do that at a different time :) What we can do What I did then:

mdadm --stop /dev/md/openqa
mdadm --create /dev/md/openqa --level=0 --force --raid-devices=2 --run /dev/nvme?n1
mdadm --detail --scan >> /etc/mdadm.conf
vim /etc/mdadm.conf
mkfs.ext2 /dev/md/openqa
mkdir -p /var/lib/openqa/{pool,cache,share} && /usr/bin/chown _openqa-worker:root /var/lib/openqa/cache /var/lib/openqa/pool /var/lib/openqa/share
systemctl start openqa-worker-cacheservice openqa-worker-cacheservice-minion openqa-worker.target
systemctl unmask openqa-worker@{13..20} && systemctl enable --now openqa-worker@{13..20}

including enabling more worker instances as we have now 1.3TB free pool space and also that many worker instances on openqaworker4.

Next thing I checked for the CACHELIMIT settings and increasing here as well:

> for i in aarch64 openqaworker1 openqaworker4 openqaworker7 power8 imagetester rebel; do echo $i && ssh root@$i "grep -B 2 CACHELIMIT /etc/openqa/workers.ini" ; done
aarch64
# okurz: 2020-05-18: jobs incompleted with "No space left on device",
# reducing 200->160
CACHELIMIT = 160
openqaworker1
openqaworker4
WORKER_HOSTNAME=192.168.112.7
CACHEDIRECTORY = /var/lib/openqa/cache
CACHELIMIT = 50
openqaworker7
# cache had been fully used to 50G and free disk space was 200G so that we
# can we can proably increase this limit
CACHELIMIT = 180
power8
CACHEDIRECTORY = /var/lib/openqa/cache
# okurz: 2020-05-09: Increased cache limit from default 50G as we have enough space
CACHELIMIT = 600
imagetester
WORKER_HOSTNAME=192.168.112.5
CACHEDIRECTORY = /var/lib/openqa/cache
CACHELIMIT = 50
rebel

Have increased cache limit to 400GB, commented in /etc/openqa/workers.ini. Also informed common users in irc://chat.freenode.net/opensuse-factory and will monitor.

#26 Updated by okurz 9 months ago

  • Status changed from Feedback to Resolved

there seemed to be one problem that I have enabled more worker instances but they did not have a tap device assigned. fvogt has resolved that in another ticket with changing /etc/openqa/workers.ini . Currently I see /dev/md127 1.3T 526G 715G 43% /var/lib/openqa and load average: 9.74, 12.20, 12.90. I think everything is ok here.

Also available in: Atom PDF