action #64685: openqaworker1 showing NVMe problems "kernel: nvme nvme0: Abort status: 0x0" - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #64685

closed

openqaworker1 showing NVMe problems "kernel: nvme nvme0: Abort status: 0x0"

Added by okurz about 5 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2020-03-20

Due date:

2020-10-23

% Done:

Estimated time:

Tags:

sporadic, worker

Description

Observation¶

[20/03/2020 11:39:23] <DimStar> Martchus_: any idea what's up here? https://openqa.opensuse.org/tests/overview?arch=&machine=&modules=&todo=1&distri=microos&distri=opensuse&version=Tumbleweed&build=20200318&groupid=1# ?
[20/03/2020 11:52:02] <guillaume_g> Defolos: fyi, https://bugzilla.opensuse.org/show_bug.cgi?id=1167232
[20/03/2020 11:52:05] <|Anna|> openSUSE bug 1167232 in openSUSE Tumbleweed "Vagrant Tumbleweed 20200317 fails due to unsupported configuration PS2" [Normal, New]
[20/03/2020 11:53:02] <DimStar> okurz: the nvme issues we had last time around was ow4, right?

On w1 in system journal:

nvme nvme0: Abort status: 0x0

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz about 5 years ago

According to journalctl | grep 'kernel: nvme nvme0: Abort status: 0x0' the first error appeared already on

Mar 18 16:18:20 openqaworker1 kernel: nvme nvme0: Abort status: 0x0

I can try to reformat the filesystem but I guess the device is broken. Maybe after all (thinking again) ext2 is not a good choice?

Actions

Copy link

Updated by okurz about 5 years ago

Recreating the RAID and creating a filesystem can reproduce the problem on nvme0

Mar 20 16:57:37 openqaworker1 kernel: md127: detected capacity change from 799906201600 to 0
Mar 20 16:57:37 openqaworker1 kernel: md: md127 stopped.
Mar 20 16:57:48 openqaworker1 kernel:  nvme0n1: p1
Mar 20 16:57:48 openqaworker1 kernel:  nvme0n1: p1
Mar 20 16:57:48 openqaworker1 kernel: md127: detected capacity change from 0 to 799906201600
Mar 20 16:57:48 openqaworker1 kernel:  nvme1n1: p1
Mar 20 16:58:37 openqaworker1 kernel: nvme nvme0: I/O 0 QID 23 timeout, aborting
Mar 20 16:58:37 openqaworker1 kernel: nvme nvme0: I/O 1 QID 23 timeout, aborting
Mar 20 16:58:37 openqaworker1 kernel: nvme nvme0: I/O 2 QID 23 timeout, aborting
Mar 20 16:58:37 openqaworker1 kernel: nvme nvme0: I/O 3 QID 23 timeout, aborting
Mar 20 16:58:37 openqaworker1 kernel: nvme nvme0: Abort status: 0x0
Mar 20 16:58:42 openqaworker1 kernel: nvme nvme0: Abort status: 0x0
Mar 20 16:58:42 openqaworker1 kernel: nvme nvme0: Abort status: 0x0
Mar 20 16:58:42 openqaworker1 kernel: nvme nvme0: Abort status: 0x0

I had to reboot w1 as it's I/O stuck. I could reproduce the NVMe problems when recreating the RAID0 used for /var/lib/openqa and trying to create a new filesystem on it. After reboot I will try to create a RAID0 from just the single NVMe1 device and bring up limited openqa worker instances which are also MM capable. On second priority I will add multi-machine support to w4 and w7, see #64700 , which I plan to do only later to not disrupt already currently running jobs not to end up with no workers running at all due to misconfiguration.

w1 back up with 12 worker instances, single NVMe.

Actions

Copy link

Updated by okurz about 5 years ago

Status changed from In Progress to Blocked

Created https://infra.nue.suse.com/SelfService/Display.html?id=166270 asking for replacement of the broken device.

Actions

Copy link

Updated by okurz about 5 years ago

Status changed from Blocked to Feedback
Assignee changed from okurz to nicksinger
Priority changed from Urgent to High

ticket was resolved as nicksinger stated that he will order a replacement so assigning ticket to him.

Reducing from "Urgent" to "High" as the machine was brought back up by me with the single NVMe present.

@nicksinger ETA?

Actions

Copy link

Updated by okurz about 5 years ago

pinged in RC

Actions

Copy link

Updated by nicksinger about 5 years ago

"REQ_402391: Order request: NVMe replacement for openqaworker1 and openqaworker7" is in "FULFILLMENT QUEUE" since 25/03/2020 09:57:39 AM. I've added a "note" asking if and how I can speed things up.

Actions

Copy link

Updated by okurz about 5 years ago

Due date set to 2020-06-01
Status changed from Feedback to Blocked

fine. It's actually not that urgent as we can live with a single NVMe for some time, just wanted to know what we can expect. So I am setting this to "Blocked" with a due date for you then accordingly.

Actions

Copy link

Status changed from Blocked to Feedback

Waiting for the hardware mentioned in https://progress.opensuse.org/issues/49694#note-22

Actions

Copy link

#17

Updated by nicksinger over 4 years ago

Status changed from Feedback to Workable

The NVMe was build into the machine today. You can check what block device is associated to the right disk by comparing the IDs of lspci with the reported paths from udev:

openqaworker1:~ # lspci | grep Non-Volatile
81:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
82:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
83:00.0 Non-Volatile memory controller: Intel Corporation SSD 660P Series (rev 03)        #<- this is the new one
openqaworker1:~ # udevadm info -q all -n /dev/nvme2
P: /devices/pci0000:80/0000:80:02.0/0000:83:00.0/nvme/nvme2
N: nvme2
E: DEVNAME=/dev/nvme2
E: DEVPATH=/devices/pci0000:80/0000:80:02.0/0000:83:00.0/nvme/nvme2                       #<- "83:00.0" from lspci shows that this NVMe is available at /dev/nvme2
E: MAJOR=245
E: MINOR=2
E: SUBSYSTEM=nvme

Actions

Copy link

#18

Updated by okurz over 4 years ago

ok but now we have the unused nvme0 still in the system. Can you rip it out and discard it? Then we can rely on the semi-automatic provisioning of the device same as for other workers. As an alternative we can manually add the second NVMe to the RAID0, forget that nmve0 is faulty, and eventually someone runs into the same problem again with nvme0 ;)

Actions

Copy link

#19

Updated by okurz over 4 years ago

@nicksinger can you please describe what is the current state for this ticket, what are your plans, suggestions, things to do or things to wait for?

Actions

Copy link

#20

Updated by nicksinger over 4 years ago

okurz wrote:

@nicksinger can you please describe what is the current state for this ticket, what are your plans, suggestions, things to do or things to wait for?
the SSD is built in now. What's left is making use of it and pulling out the old one. As we have two "old" ones in there we need some kind of coordination with Infra when they pull it so see if it's the right one

Actions

Copy link

#21

Updated by okurz over 4 years ago

nicksinger wrote:

the SSD is built in now. What's left is making use of it and pulling out the old one. As we have two "old" ones in there we need some kind of coordination with Infra when they pull it so see if it's the right one

Well, lsblk -o NAME,MODEL,SERIAL tells me:

NAME    MODEL                                    SERIAL
sda     HGST HTE721010A9                         
├─sda1                                           
├─sda2                                           
└─sda3                                           
  └─md1                                          
nvme2n1 INTEL SSDPEKNW010T8                      BTNH022600SN1P0B    
nvme1n1 INTEL SSDPE2ME400G4                      CVMD4352001W400FGN  
└─md127                                          
nvme0n1 INTEL SSDPE2ME400G4                      CVMD4352007S400FGN

so can you ask your Infra contact person like gschlotter to rip out the device with serial id CVMD4352007S400FGN whenever the machine is not fully utilized? You could either reopen https://infra.nue.suse.com/SelfService/Display.html?id=176945 or create a new ticket.

Actions

Copy link

#22

Updated by nicksinger over 4 years ago

Assignee deleted (~~nicksinger~~)
Priority changed from High to Low

I doubt infra can tell the tray if you give them a serial number. I know some servers can blink a light at a specific tray slot but I don't know if this is supported here. After all it should definitely be combined with the efforts to remove the broken disk from worker7 (see https://progress.opensuse.org/issues/49694).

Actions

Copy link

#23

Updated by okurz over 4 years ago

Status changed from Workable to In Progress
Assignee set to okurz
Priority changed from Low to High

fine. But I would like to have this handled still with high prio. I will create corresponding infra tickets then.

Actions

Copy link

#24

Updated by okurz over 4 years ago

Due date changed from 2020-06-01 to 2020-10-23
Status changed from In Progress to Blocked
Priority changed from High to Normal

https://infra.nue.suse.com/SelfService/Display.html?id=178292

Actions

Copy link

#25

Updated by okurz over 4 years ago

Status changed from Blocked to Feedback

ticket resolved, machine was powered on but not coming up properly, reported by fvogt. Could not easily recover myself due to problem with ipmitool but version of ipmitool from "wotan" was ok. fvogt fixed it with https://paste.opensuse.org/view/raw/34762541 , i.e.:

Issue:
error: symbol `grub_file_filters' not found.
grub rescue> 
-> Mismatch of installed core.img and modules

Idea:
In the rescue shell, set the prefix to the location of modules from an older snapshot

Procedure:
set btrfs_relative_path=n
ls (mduuid/958978b17edb5f3df27f5be3c639f19b)/@/.snapshots/
-> oldest is 755/
set btrfs_subvol=/@/.snapshots/755/snapshot
set btrfs_relative_path=y
set prefix=/usr/share/grub2/
insmod normal
normal
configfile /boot/grub2/grub.cfg
-> System boots

system is up, we need to update the RAID config but the system is currently working hard on a new Tumbleweed snapshot. I can also do that at a different time :) ~~What we can do~~ What I did then:

mdadm --stop /dev/md/openqa
mdadm --create /dev/md/openqa --level=0 --force --raid-devices=2 --run /dev/nvme?n1
mdadm --detail --scan >> /etc/mdadm.conf
vim /etc/mdadm.conf
mkfs.ext2 /dev/md/openqa
mkdir -p /var/lib/openqa/{pool,cache,share} && /usr/bin/chown _openqa-worker:root /var/lib/openqa/cache /var/lib/openqa/pool /var/lib/openqa/share
systemctl start openqa-worker-cacheservice openqa-worker-cacheservice-minion openqa-worker.target
systemctl unmask openqa-worker@{13..20} && systemctl enable --now openqa-worker@{13..20}

including enabling more worker instances as we have now 1.3TB free pool space and also that many worker instances on openqaworker4.

Next thing I checked for the CACHELIMIT settings and increasing here as well:

> for i in aarch64 openqaworker1 openqaworker4 openqaworker7 power8 imagetester rebel; do echo $i && ssh root@$i "grep -B 2 CACHELIMIT /etc/openqa/workers.ini" ; done
aarch64
# okurz: 2020-05-18: jobs incompleted with "No space left on device",
# reducing 200->160
CACHELIMIT = 160
openqaworker1
openqaworker4
WORKER_HOSTNAME=192.168.112.7
CACHEDIRECTORY = /var/lib/openqa/cache
CACHELIMIT = 50
openqaworker7
# cache had been fully used to 50G and free disk space was 200G so that we
# can we can proably increase this limit
CACHELIMIT = 180
power8
CACHEDIRECTORY = /var/lib/openqa/cache
# okurz: 2020-05-09: Increased cache limit from default 50G as we have enough space
CACHELIMIT = 600
imagetester
WORKER_HOSTNAME=192.168.112.5
CACHEDIRECTORY = /var/lib/openqa/cache
CACHELIMIT = 50
rebel

Have increased cache limit to 400GB, commented in /etc/openqa/workers.ini. Also informed common users in irc://chat.freenode.net/opensuse-factory and will monitor.

Actions

Copy link

#26

Updated by okurz over 4 years ago

Status changed from Feedback to Resolved

there seemed to be one problem that I have enabled more worker instances but they did not have a tap device assigned. fvogt has resolved that in another ticket with changing /etc/openqa/workers.ini . Currently I see /dev/md127 1.3T 526G 715G 43% /var/lib/openqa and load average: 9.74, 12.20, 12.90. I think everything is ok here.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #64685

openqaworker1 showing NVMe problems "kernel: nvme nvme0: Abort status: 0x0"

Observation¶

Updated by okurz about 5 years ago

Updated by okurz about 5 years ago

Updated by okurz about 5 years ago

Updated by okurz about 5 years ago

Updated by okurz about 5 years ago

Updated by nicksinger about 5 years ago

Updated by okurz about 5 years ago

Updated by nicksinger about 5 years ago

Updated by okurz almost 5 years ago

Updated by mgriessmeier almost 5 years ago

Updated by okurz almost 5 years ago

Updated by okurz almost 5 years ago

Updated by okurz over 4 years ago

Updated by nicksinger over 4 years ago

Updated by nicksinger over 4 years ago

Updated by nicksinger over 4 years ago

Updated by nicksinger over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by nicksinger over 4 years ago

Updated by okurz over 4 years ago

Updated by nicksinger over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago