action #62849
closedbroken NVMe on openqaworker4 auto_review:"No space left on device"
0%
Description
Observation¶
from #opensuse-factory :
[30/01/2020 14:02:47] <DimStar> do we have some openQA network performanc issues? I see tons of tests, like on await_install for example (https://openqa.opensuse.org/tests/1159380#step/await_install/11 - normally the entire test for creating the HDD runs like 45 minutes)
[30/01/2020 14:03:12] <DimStar> hm. that should not even be net related - that's installing from the DVD
[30/01/2020 14:03:58] <DimStar> ow4: top - 14:03:52 up 10:30, 1 user, load average: 50.14, 106.59, 145.35
[30/01/2020 14:04:09] <DimStar> that's a good load :)
…
[30/01/2020 14:21:18] <DimStar> Martchus: https://openqa.opensuse.org/tests/overview?arch=&machine=&modules=&todo=1&distri=microos&distri=opensuse&version=Tumbleweed&build=20200129&groupid=1# - pretty much all those failures are 'performance related' - and all happened on OW4
system journal on openqaworker4 goes back to 2019-11-20 but only since 2020-01-30 we see sudo journalctl | grep 'nvme.*timeout' | less
Jan 30 06:22:18 openqaworker4 kernel: nvme nvme0: I/O 814 QID 22 timeout, aborting
Jan 30 06:22:18 openqaworker4 kernel: nvme nvme0: I/O 816 QID 22 timeout, aborting
Jan 30 06:22:18 openqaworker4 kernel: nvme nvme0: I/O 818 QID 22 timeout, aborting
…
Jan 31 09:40:17 openqaworker4 kernel: nvme nvme0: I/O 242 QID 1 timeout, aborting
Jan 31 09:40:17 openqaworker4 kernel: nvme nvme0: I/O 245 QID 1 timeout, aborting
Jan 31 09:40:17 openqaworker4 kernel: nvme nvme0: I/O 247 QID 1 timeout, aborting
so the problem persists over the nightly upgrade.
Updated by okurz almost 5 years ago
as a potential short-term remedy I stopped some openqa worker instances on the machine: systemctl disable --now openqa-worker@{11..16}
. Maybe other tests have a better chance at finishing when there is less load but actually I doubt it helps.
Updated by okurz almost 5 years ago
Created https://infra.nue.suse.com/SelfService/Display.html?id=162199 shared with #59858
Updated by okurz almost 5 years ago
- Status changed from New to Blocked
- Assignee set to okurz
I stopped all openQA worker instances on openqaworker4 as it seems it can not successfully finish any test job. I have raised a ticket with SUSE Engineering Infrastructure and I will also look into moving a machine from the OSD pool to o3.
Updated by mkittler almost 5 years ago
Yesterday I ran smartctl on nvme0 but it didn't show any errors and I couldn't run self-tests: "SMART overall-health self-assessment test result: FAILED! - media has been placed in read only mode"
We also noticed that even a single touch
can take several seconds so the file system is seriously slow.
Updated by okurz almost 5 years ago
- Related to action #62162: Move one openqa worker machine from osd to o3 added
Updated by okurz almost 5 years ago
Update in https://infra.nue.suse.com/SelfService/Display.html?id=162199 "I did something", I assume manual check of physical cable connections. I doubt this will help but nevertheless, I setup the partitions according to #19238 , e.g. put all (both) NVMe's in a RAID0 and use as shared cache+pool but setup openqa worker instances with a new worker class "qemu_x86_64_poo62849" however already on mkfs.ext2 /dev/md2 we get in dmesg
[Mon Feb 3 16:05:47 2020] nvme nvme0: I/O 660 QID 21 timeout, aborting
[Mon Feb 3 16:05:47 2020] nvme nvme0: I/O 662 QID 21 timeout, aborting
[Mon Feb 3 16:05:47 2020] nvme nvme0: I/O 664 QID 21 timeout, aborting
[Mon Feb 3 16:05:47 2020] nvme nvme0: I/O 666 QID 21 timeout, aborting
[Mon Feb 3 16:05:47 2020] nvme nvme0: Abort status: 0x0
[Mon Feb 3 16:05:47 2020] nvme nvme0: I/O 802 QID 21 timeout, aborting
[Mon Feb 3 16:05:52 2020] nvme nvme0: Abort status: 0x0
[Mon Feb 3 16:05:52 2020] nvme nvme0: Abort status: 0x0
[Mon Feb 3 16:05:52 2020] nvme nvme0: Abort status: 0x0
[Mon Feb 3 16:05:52 2020] nvme nvme0: Abort status: 0x0
so nvme0 can still be considered broken. But with the changed setup it is now easy to exclude nvme0 and just use nvme1 for the time being:
mdadm --manage /dev/md2 --fail /dev/nvme0n1 && mdadm --manage /dev/md2 --remove /dev/nvme0n1
but first we need to reboot to be able to unblock. Did that and re-setup the RAID. Also, smart data shows that the device is reported as broken:
# smartctl -a /dev/nvme1
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp151.28.36-default] (SUSE RPM)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: INTEL SSDPE2ME400G4
Serial Number: PHMD5486006J400FGN
Firmware Version: 8DV10171
PCI Vendor/Subsystem ID: 0x8086
IEEE OUI Identifier: 0x5cd2e4
Controller ID: 0
Number of Namespaces: 1
Namespace 1 Size/Capacity: 400,088,457,216 [400 GB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Mon Feb 3 16:25:18 2020 CET
Firmware Updates (0x02): 1 Slot
Optional Admin Commands (0x0006): Format Frmw_DL
Optional NVM Commands (0x0006): Wr_Unc DS_Mngmt
Maximum Data Transfer Size: 32 Pages
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 25.00W - - 0 0 0 0 0 0
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 2
1 - 512 8 2
2 - 512 16 2
3 - 4096 0 0
4 - 4096 8 0
5 - 4096 64 0
6 - 4096 128 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning: 0x00
Temperature: 18 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 1%
Data Units Read: 127,214,186 [65.1 TB]
Data Units Written: 138,042,960 [70.6 TB]
Host Read Commands: 826,913,364
Host Write Commands: 567,132,398
Controller Busy Time: 1,362
Power Cycles: 62
Power On Hours: 34,139
Unsafe Shutdowns: 48
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged
openqaworker4:~ # smartctl -a /dev/nvme0
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp151.28.36-default] (SUSE RPM)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: INTEL SSDPE2ME400G4
Serial Number: PHMD5486006N400FGN
Firmware Version: 8DV10171
PCI Vendor/Subsystem ID: 0x8086
IEEE OUI Identifier: 0x5cd2e4
Controller ID: 0
Number of Namespaces: 1
Namespace 1 Size/Capacity: 400,088,457,216 [400 GB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Mon Feb 3 16:25:27 2020 CET
Firmware Updates (0x02): 1 Slot
Optional Admin Commands (0x0006): Format Frmw_DL
Optional NVM Commands (0x0006): Wr_Unc DS_Mngmt
Maximum Data Transfer Size: 32 Pages
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 25.00W - - 0 0 0 0 0 0
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 2
1 - 512 8 2
2 - 512 16 2
3 - 4096 0 0
4 - 4096 8 0
5 - 4096 64 0
6 - 4096 128 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- media has been placed in read only mode
SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning: 0x08
Temperature: 19 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 105%
Data Units Read: 638,301,999 [326 TB]
Data Units Written: 8,531,815,272 [4.36 PB]
Host Read Commands: 4,637,747,641
Host Write Commands: 40,016,781,281
Controller Busy Time: 206,712
Power Cycles: 60
Power On Hours: 34,148
Unsafe Shutdowns: 49
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged
Scheduled some test jobs:
for i in {1..20}; do openqa-clone-job --within-instance https://openqa.opensuse.org 1162816 WORKER_CLASS=qemu_x86_64_poo62849 _GROUP=0 BUILD=poo62849 TEST=create_hdd_minimalx_poo62849_$i; done
Created job #1163550: opensuse-Tumbleweed-NET-x86_64-Build20200201-create_hdd_minimalx@64bit -> https://openqa.opensuse.org/t1163550
build
https://openqa.opensuse.org/tests/overview?distri=opensuse&build=poo62849&version=Tumbleweed
shows 20/20 passed.
The machine did not come up after reboot because there was a file /etc/fstab.sys which still listed /var/lib/openqa/pool and /var/lib/openqa/cache mount points . I disabled the two lines in that file and did transactional-update initrd && reboot
. Machine is now back up with only the second NVMe used.
Updated by okurz almost 5 years ago
- Subject changed from broken NVMe on openqaworker4 to broken NVMe on openqaworker4 auto_review:"No space left on device"
[05/02/2020 16:34:41] <DimStar> okurz[m]: any experiments happening with ow4?
[05/02/2020 16:35:36] <okurz> DimStar: not really, what's up?
[05/02/2020 16:35:45] <DimStar> okurz: https://openqa.opensuse.org/tests/overview?arch=&machine=&modules=&todo=1&distri=microos&distri=opensuse&version=Tumbleweed&build=20200204&groupid=1#
[05/02/2020 16:35:51] <DimStar> all incompletes are from OW4
[05/02/2020 16:40:11] <okurz> DimStar: not sure yet but I will shut off the openQA worker instances
[05/02/2020 16:41:39] <DimStar> okurz: k; seems it worked sort of ok for a day or two, but now starts to mess up again.. not sure what the ratio between failed and passed jobs there is currently on ow4
[05/02/2020 16:41:45] <okurz> DimStar: I see now, out of space :( Problem is that one NVMe is not enough for 16 worker instances, I guess if we run less it can work
[05/02/2020 16:42:47] <DimStar> let's run with 8? still better than missing the entire box
[05/02/2020 16:43:35] <okurz> yeah, I guess
[05/02/2020 16:44:06] <okurz> retriggered incompletes
[05/02/2020 16:44:10] <DimStar> thanks!
incompletes on openqaworker4 again, e.g. https://openqa.opensuse.org/tests/1164661/file/autoinst-log.txt showing "No space left on device".
Disabled all 16 worker instances, restarted incompletes with env worker=openqaworker4 ~/local/os-autoinst/scripts/openqa-restart-incompletes-on-worker-instance
and will bring up 8 again.
EDIT: Ok, seems it was me missing another thing instead. /etc/openqa/workers.ini stated "CACHELIMIT = 300" so we would fill up the disk with cache, not pool. Reduced to default 50 and restarted the cacheservice and 8 worker instances.
EDIT: 2020-02-06: After tests looked fine so far I also enabled instances 9..12
Updated by okurz over 4 years ago
gschlotter informed me yesterday he will care about a "service request" for openqaworker4 in the next days.
Updated by okurz over 4 years ago
- Priority changed from Urgent to Normal
We brought in openqaworker7 into the o3 infrastructure so that the reduced capacity of openqaworker4 is less of a problem
Updated by okurz over 4 years ago
- Status changed from Blocked to Feedback
https://infra.nue.suse.com/SelfService/Display.html?id=162199 was resolved after openqaworker4 received a new, second NVMe. I can already see it in the system but would shift the update of the worker config to a time when less tests are running.
Updated by okurz over 4 years ago
- Due date set to 2020-03-10
systemctl stop openqa-worker.target openqa-worker@\* && umount /var/lib/openqa/share && umount /var/lib/openqa
mdadm --stop /dev/md/openqa
mdadm --create /dev/md/openqa --level=0 --force --raid-devices=2 --run /dev/nvme?n1
mdadm --detail --scan >> /etc/mdadm.conf
# delete duplicate entries
vim /etc/mdadm.conf
mkfs.ext2 /dev/md/openqa
mount /var/lib/openqa
mkdir -p /var/lib/openqa/{pool,cache,share} && /usr/bin/chown _openqa-worker:root /var/lib/openqa/{pool,cache,share}
mount -a
systemctl unmask openqa-worker-cacheservice openqa-worker-cacheservice-minion openqa-worker.target openqa-worker@{1..20}
systemctl enable --now openqa-worker-cacheservice openqa-worker-cacheservice-minion openqa-worker.target openqa-worker@{1..20}
All up and running.
Updated by okurz over 4 years ago
- Status changed from Feedback to Resolved
Realized that /etc/openqa/workers.ini was not configuring all instances correctly. Changed now to use the generic WORKER_CLASS and only define special classes for instance 7+8 which were previously also handled in a special manner – probably for no apparent reason. Now instances 1..20 are up, show up fine on https://openqa.opensuse.org/admin/workers . I have checked the history of all worker instances and they all look ok.