Project

General

Profile

action #62849

broken NVMe on openqaworker4 auto_review:"No space left on device"

Added by okurz 4 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Start date:
2020-01-31
Due date:
2020-03-10
% Done:

0%

Estimated time:
Duration: 28

Description

Observation

from [#opensuse-factory](irc://chat.freenode.net/opensuse-factory) :

[30/01/2020 14:02:47] <DimStar> do we have some openQA network performanc issues? I see tons of tests, like on await_install for example (https://openqa.opensuse.org/tests/1159380#step/await_install/11 - normally the entire test for creating the HDD runs like 45 minutes)
[30/01/2020 14:03:12] <DimStar> hm. that should not even be net related - that's installing from the DVD
[30/01/2020 14:03:58] <DimStar> ow4: top - 14:03:52 up 10:30,  1 user,  load average: 50.14, 106.59, 145.35
[30/01/2020 14:04:09] <DimStar> that's a good load :)
…
[30/01/2020 14:21:18] <DimStar> Martchus: https://openqa.opensuse.org/tests/overview?arch=&machine=&modules=&todo=1&distri=microos&distri=opensuse&version=Tumbleweed&build=20200129&groupid=1# - pretty much all those failures are 'performance related' - and all happened on OW4

system journal on openqaworker4 goes back to 2019-11-20 but only since 2020-01-30 we see sudo journalctl | grep 'nvme.*timeout' | less

Jan 30 06:22:18 openqaworker4 kernel: nvme nvme0: I/O 814 QID 22 timeout, aborting
Jan 30 06:22:18 openqaworker4 kernel: nvme nvme0: I/O 816 QID 22 timeout, aborting
Jan 30 06:22:18 openqaworker4 kernel: nvme nvme0: I/O 818 QID 22 timeout, aborting
…
Jan 31 09:40:17 openqaworker4 kernel: nvme nvme0: I/O 242 QID 1 timeout, aborting
Jan 31 09:40:17 openqaworker4 kernel: nvme nvme0: I/O 245 QID 1 timeout, aborting
Jan 31 09:40:17 openqaworker4 kernel: nvme nvme0: I/O 247 QID 1 timeout, aborting

so the problem persists over the nightly upgrade.


Related issues

Related to openQA Infrastructure - action #62162: Move one openqa worker machine from osd to o3Resolved2020-01-15

History

#1 Updated by okurz 4 months ago

as a potential short-term remedy I stopped some openqa worker instances on the machine: systemctl disable --now openqa-worker@{11..16} . Maybe other tests have a better chance at finishing when there is less load but actually I doubt it helps.

#3 Updated by okurz 4 months ago

  • Status changed from New to Blocked
  • Assignee set to okurz

I stopped all openQA worker instances on openqaworker4 as it seems it can not successfully finish any test job. I have raised a ticket with SUSE Engineering Infrastructure and I will also look into moving a machine from the OSD pool to o3.

#4 Updated by mkittler 4 months ago

Yesterday I ran smartctl on nvme0 but it didn't show any errors and I couldn't run self-tests: "SMART overall-health self-assessment test result: FAILED! - media has been placed in read only mode"

We also noticed that even a single touch can take several seconds so the file system is seriously slow.

#5 Updated by okurz 4 months ago

  • Related to action #62162: Move one openqa worker machine from osd to o3 added

#6 Updated by okurz 4 months ago

Update in https://infra.nue.suse.com/SelfService/Display.html?id=162199 "I did something", I assume manual check of physical cable connections. I doubt this will help but nevertheless, I setup the partitions according to #19238 , e.g. put all (both) NVMe's in a RAID0 and use as shared cache+pool but setup openqa worker instances with a new worker class "qemu_x86_64_poo62849" however already on mkfs.ext2 /dev/md2 we get in dmesg

[Mon Feb  3 16:05:47 2020] nvme nvme0: I/O 660 QID 21 timeout, aborting
[Mon Feb  3 16:05:47 2020] nvme nvme0: I/O 662 QID 21 timeout, aborting
[Mon Feb  3 16:05:47 2020] nvme nvme0: I/O 664 QID 21 timeout, aborting
[Mon Feb  3 16:05:47 2020] nvme nvme0: I/O 666 QID 21 timeout, aborting
[Mon Feb  3 16:05:47 2020] nvme nvme0: Abort status: 0x0
[Mon Feb  3 16:05:47 2020] nvme nvme0: I/O 802 QID 21 timeout, aborting
[Mon Feb  3 16:05:52 2020] nvme nvme0: Abort status: 0x0
[Mon Feb  3 16:05:52 2020] nvme nvme0: Abort status: 0x0
[Mon Feb  3 16:05:52 2020] nvme nvme0: Abort status: 0x0
[Mon Feb  3 16:05:52 2020] nvme nvme0: Abort status: 0x0

so nvme0 can still be considered broken. But with the changed setup it is now easy to exclude nvme0 and just use nvme1 for the time being:

mdadm --manage /dev/md2 --fail /dev/nvme0n1 && mdadm --manage /dev/md2 --remove /dev/nvme0n1

but first we need to reboot to be able to unblock. Did that and re-setup the RAID. Also, smart data shows that the device is reported as broken:

# smartctl -a /dev/nvme1
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp151.28.36-default] (SUSE RPM)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       INTEL SSDPE2ME400G4
Serial Number:                      PHMD5486006J400FGN
Firmware Version:                   8DV10171
PCI Vendor/Subsystem ID:            0x8086
IEEE OUI Identifier:                0x5cd2e4
Controller ID:                      0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          400,088,457,216 [400 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Mon Feb  3 16:25:18 2020 CET
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0006):   Format Frmw_DL
Optional NVM Commands (0x0006):     Wr_Unc DS_Mngmt
Maximum Data Transfer Size:         32 Pages

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    25.00W       -        -    0  0  0  0        0       0

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -     512       8         2
 2 -     512      16         2
 3 -    4096       0         0
 4 -    4096       8         0
 5 -    4096      64         0
 6 -    4096     128         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        18 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    1%
Data Units Read:                    127,214,186 [65.1 TB]
Data Units Written:                 138,042,960 [70.6 TB]
Host Read Commands:                 826,913,364
Host Write Commands:                567,132,398
Controller Busy Time:               1,362
Power Cycles:                       62
Power On Hours:                     34,139
Unsafe Shutdowns:                   48
Media and Data Integrity Errors:    0
Error Information Log Entries:      0

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

openqaworker4:~ # smartctl -a /dev/nvme0
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp151.28.36-default] (SUSE RPM)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       INTEL SSDPE2ME400G4
Serial Number:                      PHMD5486006N400FGN
Firmware Version:                   8DV10171
PCI Vendor/Subsystem ID:            0x8086
IEEE OUI Identifier:                0x5cd2e4
Controller ID:                      0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          400,088,457,216 [400 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Mon Feb  3 16:25:27 2020 CET
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0006):   Format Frmw_DL
Optional NVM Commands (0x0006):     Wr_Unc DS_Mngmt
Maximum Data Transfer Size:         32 Pages

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    25.00W       -        -    0  0  0  0        0       0

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -     512       8         2
 2 -     512      16         2
 3 -    4096       0         0
 4 -    4096       8         0
 5 -    4096      64         0
 6 -    4096     128         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- media has been placed in read only mode

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x08
Temperature:                        19 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    105%
Data Units Read:                    638,301,999 [326 TB]
Data Units Written:                 8,531,815,272 [4.36 PB]
Host Read Commands:                 4,637,747,641
Host Write Commands:                40,016,781,281
Controller Busy Time:               206,712
Power Cycles:                       60
Power On Hours:                     34,148
Unsafe Shutdowns:                   49
Media and Data Integrity Errors:    0
Error Information Log Entries:      0

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

Scheduled some test jobs:

for i in {1..20}; do openqa-clone-job --within-instance https://openqa.opensuse.org 1162816 WORKER_CLASS=qemu_x86_64_poo62849 _GROUP=0 BUILD=poo62849 TEST=create_hdd_minimalx_poo62849_$i; done

Created job #1163550: opensuse-Tumbleweed-NET-x86_64-Build20200201-create_hdd_minimalx@64bit -> https://openqa.opensuse.org/t1163550

build
https://openqa.opensuse.org/tests/overview?distri=opensuse&build=poo62849&version=Tumbleweed
shows 20/20 passed.

The machine did not come up after reboot because there was a file /etc/fstab.sys which still listed /var/lib/openqa/pool and /var/lib/openqa/cache mount points . I disabled the two lines in that file and did transactional-update initrd && reboot. Machine is now back up with only the second NVMe used.

#7 Updated by okurz 4 months ago

  • Subject changed from broken NVMe on openqaworker4 to broken NVMe on openqaworker4 auto_review:"No space left on device"
[05/02/2020 16:34:41] <DimStar> okurz[m]: any experiments happening with ow4?
[05/02/2020 16:35:36] <okurz> DimStar: not really, what's up?
[05/02/2020 16:35:45] <DimStar> okurz: https://openqa.opensuse.org/tests/overview?arch=&machine=&modules=&todo=1&distri=microos&distri=opensuse&version=Tumbleweed&build=20200204&groupid=1#
[05/02/2020 16:35:51] <DimStar> all incompletes are from OW4
[05/02/2020 16:40:11] <okurz> DimStar: not sure yet but I will shut off the openQA worker instances
[05/02/2020 16:41:39] <DimStar> okurz: k; seems it worked sort of ok for a day or two, but now starts to mess up again.. not sure what the ratio between failed and passed jobs there is currently on ow4
[05/02/2020 16:41:45] <okurz> DimStar: I see now, out of space :( Problem is that one NVMe is not enough for 16 worker instances, I guess if we run less it can work
[05/02/2020 16:42:47] <DimStar> let's run with 8? still better than missing the entire box
[05/02/2020 16:43:35] <okurz> yeah, I guess
[05/02/2020 16:44:06] <okurz> retriggered incompletes
[05/02/2020 16:44:10] <DimStar> thanks!

incompletes on openqaworker4 again, e.g. https://openqa.opensuse.org/tests/1164661/file/autoinst-log.txt showing "No space left on device".

Disabled all 16 worker instances, restarted incompletes with env worker=openqaworker4 ~/local/os-autoinst/scripts/openqa-restart-incompletes-on-worker-instance and will bring up 8 again.

EDIT: Ok, seems it was me missing another thing instead. /etc/openqa/workers.ini stated "CACHELIMIT = 300" so we would fill up the disk with cache, not pool. Reduced to default 50 and restarted the cacheservice and 8 worker instances.

EDIT: 2020-02-06: After tests looked fine so far I also enabled instances 9..12

#8 Updated by okurz 3 months ago

gschlotter informed me yesterday he will care about a "service request" for openqaworker4 in the next days.

#9 Updated by okurz 3 months ago

  • Priority changed from Urgent to Normal

We brought in openqaworker7 into the o3 infrastructure so that the reduced capacity of openqaworker4 is less of a problem

#10 Updated by okurz 3 months ago

  • Status changed from Blocked to Feedback

https://infra.nue.suse.com/SelfService/Display.html?id=162199 was resolved after openqaworker4 received a new, second NVMe. I can already see it in the system but would shift the update of the worker config to a time when less tests are running.

#11 Updated by okurz 3 months ago

  • Due date set to 2020-03-10
systemctl stop openqa-worker.target openqa-worker@\* && umount /var/lib/openqa/share && umount /var/lib/openqa
mdadm --stop /dev/md/openqa
mdadm --create /dev/md/openqa --level=0 --force --raid-devices=2 --run /dev/nvme?n1
mdadm --detail --scan >> /etc/mdadm.conf 
# delete duplicate entries
vim /etc/mdadm.conf
mkfs.ext2 /dev/md/openqa
mount /var/lib/openqa
mkdir -p /var/lib/openqa/{pool,cache,share} && /usr/bin/chown _openqa-worker:root /var/lib/openqa/{pool,cache,share}
mount -a
systemctl unmask openqa-worker-cacheservice openqa-worker-cacheservice-minion openqa-worker.target openqa-worker@{1..20}
systemctl enable --now openqa-worker-cacheservice openqa-worker-cacheservice-minion openqa-worker.target openqa-worker@{1..20}

All up and running.

#12 Updated by okurz 3 months ago

  • Status changed from Feedback to Resolved

Realized that /etc/openqa/workers.ini was not configuring all instances correctly. Changed now to use the generic WORKER_CLASS and only define special classes for instance 7+8 which were previously also handled in a special manner – probably for no apparent reason. Now instances 1..20 are up, show up fine on https://openqa.opensuse.org/admin/workers . I have checked the history of all worker instances and they all look ok.

Also available in: Atom PDF