action #103524
closedOW1: performance loss size:M
Added by dimstar about 3 years ago. Updated over 2 years ago.
100%
Description
Observation¶
A lot of tests are failing since Friday, even the rather 'simple' raid install tests. The failure happens very often in 'await_install', after having spent 1 hour being busy installing RPMs. Those tests generally pass in 20 minutes through the installer.
This started happening across the product to be tested as well as also in stagings, at the same time, which makes 'product-related issues' less probable than ow1 related issues.
Acceptance criteria¶
- AC1: openqaworker1 is back in production with a stable set of worker instances
Suggestions¶
- Reduce the number of openQA worker instances until we don't commonly have failing or incompleting jobs
- If there are recurring I/O errors showing up in logs consider replacing hardware again or the other NVMe if that is faulty
Rollback actions¶
- Use a proper RAID again of all NVMe devices for /var/lib/openqa (fvogdt has used an unusual direct use of /dev/nvme0n1 for /var/lib/openqa for now)
- Enable additional worker instances again after hardware replacement:
systemctl unmask --now openqa-worker-auto-restart@{16..20}
- Increase WORKERCACHE in openqaworker1:/etc/openqa/workers.ini to a higher value, e.g. 400GB again after replacement of NVMe an increase of space
- Increase number of worker instances again, i.e. enable worker instance [7..16]
Updated by favogt about 3 years ago
There are disk issues:
openqaworker1:/etc # smartctl -a /dev/nvme0n1
smartctl 7.0 2019-05-21 r4917 [x86_64-linux-5.3.18-59.34-default] (SUSE RPM)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: INTEL SSDPE2ME400G4
Serial Number: CVMD4352001W400FGN
Firmware Version: 8DV10151
PCI Vendor/Subsystem ID: 0x8086
IEEE OUI Identifier: 0x5cd2e4
Controller ID: 0
Number of Namespaces: 1
Namespace 1 Size/Capacity: 400.088.457.216 [400 GB]
Namespace 1 Formatted LBA Size: 4096
Local Time is: Mon Dec 6 10:02:32 2021 CET
Firmware Updates (0x02): 1 Slot
Optional Admin Commands (0x0006): Format Frmw_DL
Optional NVM Commands (0x0006): Wr_Unc DS_Mngmt
Maximum Data Transfer Size: 32 Pages
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 25.00W - - 0 0 0 0 0 0
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 - 512 0 2
1 - 512 8 2
2 - 512 16 2
3 + 4096 0 0
4 - 4096 8 0
5 - 4096 64 0
6 - 4096 128 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 24 Celsius
Available Spare: 99%
Available Spare Threshold: 10%
Percentage Used: 32%
Data Units Read: 1.123.143.620 [575 TB]
Data Units Written: 2.491.413.126 [1,27 PB]
Host Read Commands: 7.436.632.694
Host Write Commands: 12.539.489.915
Controller Busy Time: 10.331
Power Cycles: 101
Power On Hours: 55.231
Unsafe Shutdowns: 21
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Error Information (NVMe Log 0x01, max 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 1 13 0x0000 0x4502 0x000 0 1 -
openqaworker1:/etc # smartctl -a /dev/nvme1n1
smartctl 7.0 2019-05-21 r4917 [x86_64-linux-5.3.18-59.34-default] (SUSE RPM)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: INTEL SSDPEKNW010T8
Serial Number: BTNH022600SN1P0B
Firmware Version: 002C
PCI Vendor/Subsystem ID: 0x8086
IEEE OUI Identifier: 0x5cd2e4
Controller ID: 1
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1.024.209.543.168 [1,02 TB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Mon Dec 6 10:02:34 2021 CET
Firmware Updates (0x14): 2 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Maximum Data Transfer Size: 32 Pages
Warning Comp. Temp. Threshold: 77 Celsius
Critical Comp. Temp. Threshold: 80 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 4.00W - - 0 0 0 0 0 0
1 + 3.00W - - 1 1 1 1 0 0
2 + 2.20W - - 2 2 2 2 0 0
3 - 0.0300W - - 3 3 3 3 5000 5000
4 - 0.0040W - - 4 4 4 4 5000 9000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x04
Temperature: 32 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 118%
Data Units Read: 960.439.256 [491 TB]
Data Units Written: 1.534.101.586 [785 TB]
Host Read Commands: 6.131.723.556
Host Write Commands: 6.981.165.126
Controller Busy Time: 322.477
Power Cycles: 5
Power On Hours: 10.708
Unsafe Shutdowns: 4
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Error Information (NVMe Log 0x01, max 256 entries)
No Errors Logged
openqaworker1:/etc # cat /proc/mdstat
Personalities : [raid1] [raid0]
md0 : active raid0 sdb2[0] sda2[1]
2115520 blocks super 1.0 32k chunks
md127 : active raid0 nvme0n1[0] nvme1n1[1]
1390651392 blocks super 1.2 512k chunks
md1 : active raid1 sdb3[0] sda3[1]
975701824 blocks super 1.0 [2/2] [UU]
bitmap: 0/8 pages [0KB], 65536KB chunk
unused devices: <none>
openqaworker1:/etc # dmesg | grep md127
[ 27.406823] md127: detected capacity change from 0 to 1424027025408
[ 29.791083] EXT4-fs (md127): mounting ext2 file system using the ext4 subsystem
[ 29.793829] EXT4-fs (md127): warning: mounting unchecked fs, running e2fsck is recommended
[ 29.800001] EXT4-fs (md127): mounted filesystem without journal. Opts: (null)
[ 92.009666] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 3516, block bitmap and bg descriptor inconsistent: 17811 vs 17814 free clusters
[ 92.719845] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 4226, block bitmap and bg descriptor inconsistent: 25284 vs 25287 free clusters
[ 110.632664] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 3518, block bitmap and bg descriptor inconsistent: 26031 vs 25834 free clusters
[ 332.995178] EXT4-fs (md127): error count since last fsck: 2825
[ 332.995181] EXT4-fs (md127): initial error at time 1629432432: mb_free_blocks:1457: inode 28803076: block 115238455
[ 332.995184] EXT4-fs (md127): last error at time 1638758894: ext4_mb_generate_buddy:747
[ 4355.696038] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 4241, block bitmap and bg descriptor inconsistent: 23039 vs 23042 free clusters
[ 4545.295790] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 3194, block bitmap and bg descriptor inconsistent: 25197 vs 25217 free clusters
[ 4607.626626] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 3236, block bitmap and bg descriptor inconsistent: 27817 vs 27912 free clusters
[ 5010.953172] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 3485, block bitmap and bg descriptor inconsistent: 16629 vs 16641 free clusters
[ 5022.249988] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 3537, block bitmap and bg descriptor inconsistent: 13818 vs 13820 free clusters
[ 5439.212789] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 3833, block bitmap and bg descriptor inconsistent: 28007 vs 28005 free clusters
[ 5835.279080] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 4141, block bitmap and bg descriptor inconsistent: 32254 vs 32287 free clusters
[ 7515.865532] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 5267, block bitmap and bg descriptor inconsistent: 12442 vs 12454 free clusters
[11877.153446] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 7195, block bitmap and bg descriptor inconsistent: 5553 vs 5666 free clusters
[11877.191196] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 7196, block bitmap and bg descriptor inconsistent: 16234 vs 18561 free clusters
[11877.225669] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 7197, block bitmap and bg descriptor inconsistent: 18454 vs 18695 free clusters
openqaworker1:/etc # dmesg | grep md1
[ 12.819165] md/raid1:md1: active with 2 out of 2 mirrors
[ 12.852649] md1: detected capacity change from 0 to 999118667776
[ 13.798414] BTRFS: device fsid ff1922d2-d2e4-4860-9634-acac681dd0f9 devid 1 transid 2287684 /dev/md1
[ 13.929720] BTRFS info (device md1): disk space caching is enabled
[ 13.929721] BTRFS info (device md1): has skinny extents
[ 14.475616] BTRFS info (device md1): disk space caching is enabled
[ 18.952912] BTRFS info (device md1): disk space caching is enabled
[ 27.406823] md127: detected capacity change from 0 to 1424027025408
[ 29.520197] BTRFS info (device md1): disk space caching is enabled
[ 29.791083] EXT4-fs (md127): mounting ext2 file system using the ext4 subsystem
[ 29.793829] EXT4-fs (md127): warning: mounting unchecked fs, running e2fsck is recommended
[ 29.800001] EXT4-fs (md127): mounted filesystem without journal. Opts: (null)
[ 92.009666] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 3516, block bitmap and bg descriptor inconsistent: 17811 vs 17814 free clusters
[ 92.719845] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 4226, block bitmap and bg descriptor inconsistent: 25284 vs 25287 free clusters
[ 110.632664] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 3518, block bitmap and bg descriptor inconsistent: 26031 vs 25834 free clusters
[ 332.995178] EXT4-fs (md127): error count since last fsck: 2825
[ 332.995181] EXT4-fs (md127): initial error at time 1629432432: mb_free_blocks:1457: inode 28803076: block 115238455
[ 332.995184] EXT4-fs (md127): last error at time 1638758894: ext4_mb_generate_buddy:747
[ 4355.696038] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 4241, block bitmap and bg descriptor inconsistent: 23039 vs 23042 free clusters
[ 4545.295790] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 3194, block bitmap and bg descriptor inconsistent: 25197 vs 25217 free clusters
[ 4607.626626] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 3236, block bitmap and bg descriptor inconsistent: 27817 vs 27912 free clusters
[ 5010.953172] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 3485, block bitmap and bg descriptor inconsistent: 16629 vs 16641 free clusters
[ 5022.249988] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 3537, block bitmap and bg descriptor inconsistent: 13818 vs 13820 free clusters
[ 5439.212789] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 3833, block bitmap and bg descriptor inconsistent: 28007 vs 28005 free clusters
[ 5835.279080] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 4141, block bitmap and bg descriptor inconsistent: 32254 vs 32287 free clusters
[ 7515.865532] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 5267, block bitmap and bg descriptor inconsistent: 12442 vs 12454 free clusters
[11877.153446] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 7195, block bitmap and bg descriptor inconsistent: 5553 vs 5666 free clusters
[11877.191196] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 7196, block bitmap and bg descriptor inconsistent: 16234 vs 18561 free clusters
[11877.225669] EXT4-fs error (device md127): ext4_mb_generate_buddy:747: group 7197, block bitmap and bg descriptor inconsistent: 18454 vs 18695 free clusters
While QEMU was running, the system load exceeded 300 and sync
never returned. After stopping QEMU, this situation got much better.
Simply recreating the ext2 as ext4 should help against the errors, but the IO issues are probably caused by the block device itself.
Updated by favogt about 3 years ago
Not the right ticket for this, but it's close enough:
I recovered openqaworker4
from a btrfs error on /
. The error caused /
to mount only read-only and every boot ended up in the emergency shell.
btrfs check /dev/md1
reported errors:
Opening filesystem to check...
Checking filesystem on /dev/md1
UUID: 9058d83e-9f1f-49be-a4c0-8f975248ad39
[1/7] checking root items
[2/7] checking extents
ref mismatch on [1054179328 16384] extent item 0, found 1
tree backref 1054179328 parent 15764 root 15764 not found in extent tree
backpointer mismatch on [1054179328 16384]
ref mismatch on [1078231040 16384] extent item 0, found 1
tree backref 1078231040 parent 15764 root 15764 not found in extent tree
backpointer mismatch on [1078231040 16384]
ref mismatch on [1090617344 16384] extent item 0, found 1
tree backref 1090617344 parent 15764 root 15764 not found in extent tree
backpointer mismatch on [1090617344 16384]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups
ERROR: out of memory
ERROR: Loading qgroups from disk: -2
ERROR: failed to check quota groups
found 22590820352 bytes used, error(s) found
total csum bytes: 14809236
total tree bytes: 233046016
total fs tree bytes: 198819840
total extent tree bytes: 16171008
btree space waste bytes: 44986480
file data blocks allocated: 1505488662528
referenced 28240560128
extent buffer leak: start 249446400 len 16384
extent buffer leak: start 257523712 len 16384
To repair that, it's necessary to have /dev/md1
not mounted. For that, I first tried to reboot and get into the GRUB menu but that was somehow disabled, both on the serial console (IPMI) as well as the graphical one (java console).
So I had to use kexec
in the emergency shell:
kexec --initrd=/boot/initrd --append="console=tty0 console=ttyS1,115200n rd.break" /boot/vmlinuz
Inside, I mounted /dev/md127
to store a btrfs metadata backup:
sh-4.4# mdadm --assemble --scan
[ 174.399341] md: md0 stopped.
[ 174.409997] md0: detected capacity change from 0 to 2153709568
mdadm: /dev/md/0 has been started with 2 drives.
[ 174.594991] md: md127 stopped.
[ 174.599073] nvme1n1: p1
[ 174.604981] md127: detected capacity change from 0 to 799906201600
mdadm: /dev/md/o[ 174.612954] nvme1n1: p1
penqa has been started with 2 drives.
sh-4.4#
sh-4.4#
sh-4.4# cat /proc/mdstat
Personalities : [raid1] [raid0]
md127 : active raid0 nvme1n1[0] nvme0n1[1]
781158400 blocks super 1.2 512k chunks
md0 : active raid0 sda1[0] sdb1[1]
2103232 blocks super 1.0 32k chunks
md1 : active raid1 sdb2[1] sda2[0]
975708992 blocks super 1.0 [2/2] [UU]
bitmap: 0/8 pages [0KB], 65536KB chunk
unused devices: <none>
sh-4.4# mount /dev/md127 /mnt
mount: /mnt: mount point does not exist.
sh-4.4# mkdir /mnt
sh-4.4# mount /dev/md127 /mnt
mount: /mnt: unknown filesystem type 'ext2'.
sh-4.4#
sh-4.4# mount /dev/md1 -o ro /sysroot
[ 210.003355] BTRFS info (device md1): disk space caching is enabled
[ 210.009570] BTRFS info (device md1): has skinny extents
... (omitting some dead-ends and mistypes)
sh-4.4# chroot /sysroot modprobe ext4
sh-4.4# mount /dev/md127 /mnt
sh-4.4# cp /sysroot/usr/sbin/btrfs-image /tmp
sh-4.4# umount /sysroot
sh-4.4# /tmp/btrfs-image -w /dev/md1 /mnt/btrfs-image-2
sh-4.4# ls -lh /mnt/btrfs-image*
-rw-r--r-- 1 root root 1.3G Dec 6 09:00 /mnt/btrfs-image
-rw-r--r-- 1 root root 1.3G Dec 6 09:15 /mnt/btrfs-image-2
And finally ran btrfs check --repair
:
sh-4.4# btrfs check --repair /dev/md1
enabling repair mode
Opening filesystem to check...
Checking filesystem on /dev/md1
UUID: 9058d83e-9f1f-49be-a4c0-8f975248ad39
[1/7] checking root items
Fixed 0 roots.
[2/7] checking extents
ref mismatch on [1054179328 16384] extent item 0, found 1
tree backref 1054179328 parent 15764 root 15764 not found in extent tree
backpointer mismatch on [1054179328 16384]
adding new tree backref on start 1054179328 len 16384 parent 0 root 15764
Repaired extent references for 1054179328
ref mismatch on [1078231040 16384] extent item 0, found 1
tree backref 1078231040 parent 15764 root 15764 not found in extent tree
backpointer mismatch on [1078231040 16384]
adding new tree backref on start 1078231040 len 16384 parent 0 root 15764
Repaired extent references for 1078231040
ref mismatch on [1090617344 16384] extent item 0, found 1
tree backref 1090617344 parent 15764 root 15764 not found in extent tree
backpointer mismatch on [1090617344 16384]
adding new tree backref on start 1090617344 len 16384 parent 0 root 15764
Repaired extent references for 1090617344
No device size related problem found
[3/7] checking free space cache
cache and super generation don't match, space cache will be invalidated
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups
ERROR: out of memory
ERROR: Loading qgroups from disk: -2
ERROR: failed to check quota groups
found 22590820352 bytes used, error(s) found
total csum bytes: 14809236
total tree bytes: 233046016
total fs tree bytes: 198819840
total extent tree bytes: 16171008
btree space waste bytes: 44986480
file data blocks allocated: 1505488662528
referenced 28240560128
extent buffer leak: start 163807232 len 16384
extent buffer leak: start 257523712 len 16384
After this, btrfs check
was mostly happy:
sh-4.4# btrfs chec k/dev/md1
Opening filesystem to check...
ERROR: mount check: cannot open k/dev/md1: No such file or directory
ERROR: could not check mount status: No such file or directory
sh-4.4# btrfs check /dev/md1
Opening filesystem to check...
Checking filesystem on /dev/md1
UUID: 9058d83e-9f1f-49be-a4c0-8f975248ad39
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
cache and super generation don't match, space cache will be invalidated
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups
ERROR: out of memory
ERROR: Loading qgroups from disk: -2
ERROR: failed to check quota groups
Finally I did a qgroup rescan (mount /dev/md1 /sysroot; btrfs quota rescan -w /sysroot
and rebooted. The system came up good.
I changed GRUB settings to the default, in the hope that this fixes the missing menu.
Updated by favogt about 3 years ago
- Status changed from New to In Progress
- Assignee set to favogt
- Priority changed from Normal to Urgent
Back to the original topic, openqaworker1
.
The NVME health doesn't look good, so what might help is to do a "factory reset" on both, a full erase cycle.
For that I entered a transactional-update shell
to install nvme-cli
and ran:
transactional update # mdadm --stop /dev/md127
mdadm: stopped /dev/md127
transactional update # nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 CVMD4352001W400FGN INTEL SSDPE2ME400G4 1 400.09 GB / 400.09 GB 4 KiB + 0 B 8DV10151
/dev/nvme1n1 BTNH022600SN1P0B INTEL SSDPEKNW010T8 1 1.02 TB / 1.02 TB 512 B + 0 B 002C
transactional update # nvme id-ctrl -H /dev/nvme0 | grep Format
[1:1] : 0x1 Format NVM Supported
[0:0] : 0 Admin Vendor Specific Commands uses Vendor Specific Format
[0:0] : 0x1 Format Applies to All Namespace(s)
[0:0] : 0 NVM Vendor Specific Commands uses Vendor Specific Format
transactional update # nvme id-ctrl -H /dev/nvme1 | grep Format
[1:1] : 0x1 Format NVM Supported
[0:0] : 0 Admin Vendor Specific Commands uses Vendor Specific Format
[0:0] : 0 Format Applies to Single Namespace(s)
[0:0] : 0 NVM Vendor Specific Commands uses Vendor Specific Format
transactional update # nvme format -r --ses=1 /dev/nvme0
You are about to format nvme0, namespace 0xffffffff(ALL namespaces).
Controller nvme0 has child namespace(s):nvme0n1
WARNING: Format may irrevocably delete this device's data.
You have 10 seconds to press Ctrl-C to cancel this operation.
Use the force [--force|-f] option to suppress this warning.
Sending format operation ...
Success formatting namespace:ffffffff
transactional update # nvme format -r --ses=1 /dev/nvme1
get-namespace-id: Inappropriate ioctl for device
transactional update # nvme format -r --ses=1 -n 1 /dev/nvme1
You are about to format nvme1, namespace 0x1.
Controller nvme1 has child namespace(s):nvme1n1
WARNING: Format may irrevocably delete this device's data.
You have 10 seconds to press Ctrl-C to cancel this operation.
Use the force [--force|-f] option to suppress this warning.
Sending format operation ...
Success formatting namespace:1
Then I remade the RAID and FS on top:
openqaworker1:~ # mdadm --create /dev/md/openqa --level=0 --raid-devices=2 --run /dev/nvme?n1
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/openqa started
openqaworker1:~ # mkfs.ext4 /dev/md/openqa
mke2fs 1.43.8 (1-Jan-2018)
Geräteblöcke werden verworfen: erledigt
Ein Dateisystem mit 347662848 (4k) Blöcken und 86917120 Inodes wird erzeugt.
UUID des Dateisystems: 4dec5b35-616d-420e-93e4-87392ab5d940
Superblock-Sicherungskopien gespeichert in den Blöcken:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848
beim Anfordern von Speicher für die Gruppentabellen: erledigt
Inode-Tabellen werden geschrieben: erledigt
Das Journal (262144 Blöcke) wird angelegt: erledigt
Die Superblöcke und die Informationen über die Dateisystemnutzung werden
geschrieben: erledigt
(I had too many tabs open and almost ran that on ow4, whoops. Thankfully mdadm --create
couldn't do anything because the devices were busy).
In the t-u shell I edited /etc/fstab
to use ext4
for /dev/md/openqa
instead of ext2
, exited and finally rebooted. It's back up and seems fine so far, but needs to resync the deleted data again. Let's see.
Updated by maritawerner about 3 years ago
- Project changed from openQA Tests (public) to openQA Infrastructure (public)
- Category deleted (
Infrastructure)
Updated by okurz about 3 years ago
- Target version set to future
Right now there is a single job running, all others are waiting for assets to download which is still going on.
Updated by favogt about 3 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
ow1 failed to come up because of https://bugzilla.opensuse.org/show_bug.cgi?id=1183856 STILL not being fixed on 15.3. I had to make the btrfs snapshot read-write, edit /etc/fstab
again, make it read-only and reboot. Should be fine now.
Updated by dimstar about 3 years ago
- Status changed from Resolved to Workable
The worker mis-behaves again, Fabian investigated and it really does look like one of the nVME's needs to be replaced.
Updated by okurz about 3 years ago
- Target version changed from future to Ready
As next steps I suggest to exclude the faulty NVMe from (INTEL SSDPEKNW010T8, Serial Number: BTNH022600SN1P0B) from the /var/lib/openqa partition and we run with remaining NVMe which should still be enough as temporary solution. Then we organize replacement, e.g. create EngInfra ticket, with an invoice forwarded to SUSE LSG QE (if it's not warranty as AFAIR the second NVMe isn't too old).
@favogt feel free to do that or assign to someone from SUSE QE Tools to follow up, e.g. nsinger or me
Updated by favogt about 3 years ago
- Status changed from Workable to Resolved
hdparm -tT /dev/md1
never returned, so the IO issues were back. With dd if=/dev/nvmeXn1 of=/dev/null bs=10M count=10000
I was able to get some performance estimates though: ~1.7GiB/s for nvme0n1
, 12-42MiB/s for nvme1n1
.
Unfortunately it was bad enough that sync
never returned, so a clean shutdown was impossible. In fact, not even reboot -f
did anything, the system continued running... I had to force a reset through the BMC (after running sync -f /
).
okurz wrote:
As next steps I suggest to exclude the faulty NVMe from (INTEL SSDPEKNW010T8, Serial Number: BTNH022600SN1P0B) from the /var/lib/openqa partition and we run with remaining NVMe which should still be enough as temporary solution. Then we organize replacement, e.g. create EngInfra ticket, with an invoice forwarded to SUSE LSG QE (if it's not warranty as AFAIR the second NVMe isn't too old).
@favogt feel free to do that or assign to someone from SUSE QE Tools to follow up, e.g. nsinger or me
Yep, already did that. mdadm complains when trying to set up a RAID (0 or 1) with just a single disk, so now it's just using /dev/nvme0n1
as ext4 directly.
I had to edit the /etc/fstab
of the snapshot itself again for that, meh. It's up and running again, let's see.
Updated by okurz about 3 years ago
- Status changed from Resolved to Feedback
…
okurz wrote:As next steps I suggest to exclude the faulty NVMe from (INTEL SSDPEKNW010T8, Serial Number: BTNH022600SN1P0B) from the /var/lib/openqa partition and we run with remaining NVMe which should still be enough as temporary solution. Then we organize replacement, e.g. create EngInfra ticket, with an invoice forwarded to SUSE LSG QE (if it's not warranty as AFAIR the second NVMe isn't too old).
@favogt feel free to do that or assign to someone from SUSE QE Tools to follow up, e.g. nsinger or me
Yep, already did that. mdadm complains when trying to set up a RAID (0 or 1) with just a single disk, so now it's just using
/dev/nvme0n1
as ext4 directly.
I had to edit the/etc/fstab
of the snapshot itself again for that, meh. It's up and running again, let's see.
We commonly create a RAID0 with a single device anyway, see https://github.com/os-autoinst/salt-states-openqa/blob/master/openqa/nvme_store/openqa-establish-nvme-setup.sh#L33
You set the ticket to "Resolved". I assume that was by mistake because if we have one broken device as long as it's there it's not really fixed, right? So setting to "Feedback" instead for now.
Updated by maritawerner about 3 years ago
@okurz to my knowledge QE LSG has bought most of the openSUSE HW, so I think that the QE Tools team should have a look into that. Or do I miss something here? Check warranty and so on.
@mgriessmeier, is that HW that we bought?
Updated by livdywan about 3 years ago
- Assignee changed from favogt to nicksinger
- Priority changed from Urgent to High
Assigning to Nick, to order the new NVME drive after discussing it briefly in the weekly
Updated by okurz about 3 years ago
- Status changed from Feedback to In Progress
maritawerner wrote:
@okurz to my knowledge QE LSG has bought most of the openSUSE HW, so I think that the QE Tools team should have a look into that. Or do I miss something here? Check warranty and so on.
Yes, that is correct. The ticket is already on our backlog (Target version "Ready"), favogt did the initial measurements and workarounds to run of one NVMe and we tracked that. With that workaround in place now nicksinger takes over to order replacement and then delegate to EngInfra to have that replaced within the machine.
Updated by livdywan about 3 years ago
- Related to action #103581: Many jobs on openqa.opensuse.org incomplete in ' timeout: setup exceeded MAX_SETUP_TIME' added
Updated by openqa_review about 3 years ago
- Due date set to 2021-12-25
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mgriessmeier almost 3 years ago
maritawerner wrote:
@okurz to my knowledge QE LSG has bought most of the openSUSE HW, so I think that the QE Tools team should have a look into that. Or do I miss something here? Check warranty and so on.
@mgriessmeier, is that HW that we bought?
not 100% sure, but I assume yes.
Nick is creating the jira-sd ticket right now and I will initiate the ordering.
Updated by nicksinger almost 3 years ago
- Status changed from In Progress to Feedback
Requested replacement ordering in https://sd.suse.com/servicedesk/customer/portal/1/SD-70873
Updated by okurz almost 3 years ago
- Due date changed from 2021-12-25 to 2022-01-28
Was the order actually followed up? Because https://sd.suse.com/servicedesk/customer/portal/1/SD-70873 looks like Jana Fiserova only closed the ticket after "guidance was provided"
Updated by nicksinger almost 3 years ago
I will remind Matthias once again to order the drive
Updated by okurz almost 3 years ago
- Description updated (diff)
After problems reported by dimstar in https://suse.slack.com/archives/C02CANHLANP/p1640344725011900 I decreased the worker cache size on openqaworker1 to 100GB and disabled openQA worker instances 16..20. Please follow the "rollback actions" defined in the ticket description after the hardware replacement.
Updated by okurz almost 3 years ago
nicksinger informed us from private conversation that we have replacement at the post office in NBG and mgriessmeier needs to pick it up and hand it to SUSE-IT (duh) and then they can continue.
Updated by mgriessmeier almost 3 years ago
okurz wrote:
nicksinger informed us from private conversation that we have replacement at the post office in NBG and mgriessmeier needs to pick it up and hand it to SUSE-IT (duh) and then they can continue.
Hi,
sorry the whole delivery mess was my fault - as I was not able to pick it up in time. Meanwhile it was sent back to the vendor and I re-issued delivery to my home address - it is expected to be delivered today.
On monday I will be in the office and hand it over to engInfra. Sorry again for delay!
cheers,
Matthias
Updated by apappas almost 3 years ago
I am not aware of the current status of the machine and the new SSD so sorry if I am complaining both too late and prematurely but I was asked to do so.
As part of the statistical investigation requested for a VR, I could force the openqaworker1 too fail or incomplete jobs very consistently.
Here are the results of my VR
https://openqa.opensuse.org/tests/overview?result=failed&result=incomplete&arch=&flavor=&machine=&test=&modules=&module_re=&distri=opensuse&version=15.3&build=vr_13808#
As you can see openqaworker1 takes more than 3 hours to bring a Leap 15.3 system from GM to latest and sometimes just crashes. This does not happen with the other workers.
This has been happening before Christmas and has been blocking my PR.
Updated by okurz almost 3 years ago
- Description updated (diff)
Thanks for your update. mkittler has stopped and disabled all workers but 1 to 6 and the containerized s390x workers to reduce the load.
Updated by nicksinger almost 3 years ago
- Status changed from Feedback to Blocked
We've placed the new disk for infra and I created https://sd.suse.com/servicedesk/customer/portal/1/SD-75112 to ask for help building it in.
Updated by nicksinger almost 3 years ago
- Due date changed from 2022-01-28 to 2022-02-04
Updated by livdywan almost 3 years ago
- Due date changed from 2022-02-04 to 2022-02-11
Bumping the due date due to Will be done on next office day
on https://sd.suse.com/servicedesk/customer/portal/1/SD-75112
Updated by favogt almost 3 years ago
- Status changed from Blocked to Workable
- Assignee changed from nicksinger to favogt
- % Done changed from 100 to 70
Looks like the NVMe got installed \o/
openqaworker1:~ # smartctl -a /dev/nvme1n1
smartctl 7.0 2019-05-21 r4917 [x86_64-linux-5.3.18-150300.59.46-default] (SUSE RPM)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: INTEL SSDPEKNW010T8
Serial Number: BTNH117605HA1P0B
Firmware Version: 004C
PCI Vendor/Subsystem ID: 0x8086
IEEE OUI Identifier: 0x5cd2e4
Controller ID: 1
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1.024.209.543.168 [1,02 TB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Wed Feb 9 12:18:42 2022 CET
Firmware Updates (0x14): 2 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Maximum Data Transfer Size: 32 Pages
Warning Comp. Temp. Threshold: 77 Celsius
Critical Comp. Temp. Threshold: 80 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 4.00W - - 0 0 0 0 0 0
1 + 3.00W - - 1 1 1 1 0 0
2 + 2.20W - - 2 2 2 2 0 0
3 - 0.0300W - - 3 3 3 3 5000 5000
4 - 0.0040W - - 4 4 4 4 5000 9000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 33 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 9 [4,60 MB]
Data Units Written: 0
Host Read Commands: 216
Host Write Commands: 0
Controller Busy Time: 0
Power Cycles: 1
Power On Hours: 28
Unsafe Shutdowns: 0
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Error Information (NVMe Log 0x01, max 256 entries)
No Errors Logged
The system is currently idle, so I stopped the workers on it and will recreate the RAID after lunch and increase the worker count afterwards again.
Updated by favogt almost 3 years ago
- Status changed from Workable to Resolved
- % Done changed from 70 to 100
openqaworker1:~ # mdadm --create /dev/md/openqa --level=0 --raid-devices=2 --run /dev/nvme?n1
mdadm: /dev/nvme0n1 appears to contain an ext2fs file system
size=390711384K mtime=Wed Feb 9 03:33:46 2022
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/openqa started.
openqaworker1:~ # mdadm --stop /dev/md/openqa
mdadm: stopped /dev/md/openqa
openqaworker1:~ # wipefs -a /dev/nvme0n1
/dev/nvme0n1: 4 Bytes wurden an Position 0x00001000 gelöscht (linux_raid_member): fc 4e 2b a9
/dev/nvme0n1: 2 Bytes wurden an Position 0x00000438 gelöscht (ext4): 53 ef
openqaworker1:~ # mdadm --create /dev/md/openqa --level=0 --raid-devices=2 --run /dev/nvme?n1
mdadm: /dev/nvme1n1 appears to be part of a raid array:
level=raid0 devices=2 ctime=Wed Feb 9 16:21:24 2022
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/openqa started.
Then I set up the mount in a transactional-update shell:
openqaworker1:~ # transactional-update shell
Checking for newer version.
transactional-update 3.2.2 started
Options: shell
Separate /var detected.
2022-02-09 16:23:08 tukit 3.2.2 started
2022-02-09 16:23:08 Options: -c1221 open
2022-02-09 16:23:09 Using snapshot 1221 as base for new snapshot 1222.
2022-02-09 16:23:09 Syncing /etc of previous snapshot 1220 as base into new snapshot /.snapshots/1222/snapshot
ID: 1222
2022-02-09 16:23:12 Transaction completed.
Opening chroot in snapshot 1222, continue with 'exit'
2022-02-09 16:23:12 tukit 3.2.2 started
2022-02-09 16:23:12 Options: call 1222 bash
2022-02-09 16:23:12 Executing `bash`:
transactional update # mkfs.ext4 /dev/md/openqa
mke2fs 1.43.8 (1-Jan-2018)
Discarding device blocks: done
Creating filesystem with 347662848 4k blocks and 86917120 inodes
Filesystem UUID: f8df103a-c2da-4f5e-b1aa-2e9064bcbe6a
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848
Allocating group tables: done
Writing inode tables: done
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done
transactional update # nano /etc/fstab
transactional update # mount /var/lib/openqa
transactional update # umount /var/lib/openqa
transactional update # exit
2022-02-09 16:26:03 Application returned with exit status 0.
To work around the /etc/fstab
handling bug in transactional-update, I copied the fstab from the resulting snapshot overlay into the snapshot itself:
openqaworker1:~ # btrfs prop set /.snapshots/1222/snapshot ro false
openqaworker1:~ # diff -u /var/lib/overlay/1222/etc/fstab /.snapshots/1222/snapshot/etc/fstab
--- /var/lib/overlay/1222/etc/fstab 2022-02-09 16:24:39.869538443 +0100
+++ /.snapshots/1222/snapshot/etc/fstab 2022-02-09 16:23:10.448519218 +0100
@@ -1,6 +1,6 @@
UUID=ff1922d2-d2e4-4860-9634-acac681dd0f9 / btrfs ro 0 0
UUID=ff1922d2-d2e4-4860-9634-acac681dd0f9 /var btrfs subvol=/@/var,x-initrd.mount 0 0
-/dev/md/openqa /var/lib/openqa ext4 defaults 0 0
+/dev/nvme0n1 /var/lib/openqa ext4 defaults 0 0
UUID=ff1922d2-d2e4-4860-9634-acac681dd0f9 /usr/local btrfs subvol=/@/usr/local 0 0
UUID=ff1922d2-d2e4-4860-9634-acac681dd0f9 /tmp btrfs subvol=/@/tmp 0 0
UUID=ff1922d2-d2e4-4860-9634-acac681dd0f9 /srv btrfs subvol=/@/srv 0 0
openqaworker1:~ # cp /var/lib/overlay/1222/etc/fstab /.snapshots/1222/snapshot/etc/fstab
openqaworker1:~ # btrfs prop set /.snapshots/1222/snapshot ro true
After a reboot it came up properly, and I unmasked all openqa-worker-auto-restart@
services and enabled workers 1-12 again.
Let's see how it goes!
Updated by okurz almost 3 years ago
Thank you. I double checked and all looks good from my side as well.
Updated by favogt almost 3 years ago
How many and which workers did ow1 have configured originally? IIRC it went up to 20, but there might've been a gap.
1-12 are configured to not have qemu_i586,vagrant
currently.
Updated by okurz almost 3 years ago
favogt wrote:
How many and which workers did ow1 have configured originally? IIRC it went up to 20, but there might've been a gap.
1-12 are configured to not haveqemu_i586,vagrant
currently.
As the description says before this ticket we had 16 instances.
Updated by favogt almost 3 years ago
okurz wrote:
favogt wrote:
How many and which workers did ow1 have configured originally? IIRC it went up to 20, but there might've been a gap.
1-12 are configured to not haveqemu_i586,vagrant
currently.As the description says
Indeed, totally missed that...
before this ticket we had 16 instances.
Actually, it mentions [7..16]
as well as {16..20}
. I enabled 1-20 now, let's see. Will be a good test for the new NVMe at least!
Updated by favogt almost 3 years ago
- Copied to action #107017: Random asset download (cache service) failures on openqaworker1 added
Updated by favogt almost 3 years ago
- Has duplicate action #107017: Random asset download (cache service) failures on openqaworker1 added
Updated by favogt almost 3 years ago
As suggested by bmwiedemann, I ran fstrim -v /var/lib/openqa
to discard all unused blocks. This took a while but brought (read!) performance back again.
I changed fstrim.timer
to run daily by adding /etc/systemd/system/fstrim.timer.d/daily.conf
:
[Unit]
Description=Discard unused blocks once a day
[Timer]
OnCalendar=daily
Currently the system is running 20 jobs and both hdparm and dd report acceptable speed:
openqaworker1:~ # ps -ef | grep qemu | wc -l
21
openqaworker1:~ # hdparm --direct -t -T /dev/md127
/dev/md127:
Timing O_DIRECT cached reads: 1484 MB in 2.00 seconds = 741.80 MB/sec
Timing O_DIRECT disk reads: 6608 MB in 3.00 seconds = 2202.56 MB/sec
openqaworker1:~ # dd if=/dev/nvme1n1 bs=1M count=1000 status=progress iflag=direct of=/dev/null
1000+0 Datensätze ein
1000+0 Datensätze aus
1048576000 Bytes (1,0 GB, 1000 MiB) kopiert, 0,763541 s, 1,4 GB/s
openqaworker1:~ # dd if=/dev/nvme0n1 bs=1M count=1000 status=progress iflag=direct of=/dev/null
1000+0 Datensätze ein
1000+0 Datensätze aus
1048576000 Bytes (1,0 GB, 1000 MiB) kopiert, 0,702554 s, 1,5 GB/s
Updated by kraih almost 3 years ago
- Status changed from Resolved to New
This issue is not resolved, as #107017 has shown.
Updated by favogt almost 3 years ago
kraih wrote:
This issue is not resolved, as #107017 has shown.
After running fstrim
it's currently "resolved again". Did an error appear after the current boot?
Updated by okurz almost 3 years ago
- Due date changed from 2022-02-11 to 2022-02-24
The performance looks ok
# hdparm --direct -t -T /dev/md127
/dev/md127:
Timing O_DIRECT cached reads: 1560 MB in 2.00 seconds = 779.75 MB/sec
Timing O_DIRECT disk reads: 4470 MB in 3.04 seconds = 1469.18 MB/sec
No related errors in dmesg
. 20 jobs running. However I think 20 worker instances is too much. For example there are incompletes like https://openqa.opensuse.org/tests/2200039 with "cache failure: Cache service queue already full (5)" which likely mean that we can't run that many right now unless we also tweak parameters about the worker cache.
Updated by kraih almost 3 years ago
okurz wrote:
The performance looks ok
# hdparm --direct -t -T /dev/md127 /dev/md127: Timing O_DIRECT cached reads: 1560 MB in 2.00 seconds = 779.75 MB/sec Timing O_DIRECT disk reads: 4470 MB in 3.04 seconds = 1469.18 MB/sec
Actually, the machine might be in trouble again. Running hdparm --direct -t -T /dev/md127
hangs now.
Updated by favogt almost 3 years ago
kraih wrote:
okurz wrote:
The performance looks ok
# hdparm --direct -t -T /dev/md127 /dev/md127: Timing O_DIRECT cached reads: 1560 MB in 2.00 seconds = 779.75 MB/sec Timing O_DIRECT disk reads: 4470 MB in 3.04 seconds = 1469.18 MB/sec
Actually, the machine might be in trouble again. Running
hdparm --direct -t -T /dev/md127
hangs now.
Yup, dd is also slow again. fstrim ran 10h ago, which should be recent enough, though it's theoretically possible that it's been written full during that time period already. If another fstrim
helps, the discard
FS option is probably worth a try.
Updated by okurz almost 3 years ago
- Subject changed from OW1: performance loss to OW1: performance loss size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler almost 3 years ago
@fvogt says after another fstrim the performance looks still ok and he added the discard FS option
Updated by mkittler almost 3 years ago
- Status changed from Workable to Feedback
So, are we good with that measures in place?
Updated by favogt almost 3 years ago
Performance is still looking ok, but there's some random corruption in some downloaded assets, causing e.g. downloaded HDDs to fail booting or upgrading:
https://openqa.opensuse.org/tests/2209518#step/upgrade_select/4
Updated by okurz almost 3 years ago
- Due date deleted (
2022-02-24) - Target version changed from Ready to future
It looks serious enough so we would like us to handle this seriously.
@favogt I will leave this to you decide if there are further actions needed. I am excluding the ticket from our backlog and remove the due date because we will only react on this on your request. If you can provide a good proof that the hardware is at fault we can order replacement (again).
Updated by favogt almost 3 years ago
- Assignee deleted (
favogt)
Except for the occasional bad download every few days, the system appears to work fine to me.
I don't know what it's caused by and I'm not sure what the best way to find out is.
Updated by okurz almost 3 years ago
- Status changed from Feedback to Workable
- Priority changed from Urgent to High
- Target version changed from future to Ready
so back to the team for checking again
Updated by livdywan almost 3 years ago
We do have two NVME devices, and we should experiment here, such as by trying out the new NVME and collecting figures from before and after (avoid looking at too old jobs)
Updated by okurz almost 3 years ago
I suggest to use https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation , run some tests without changes, then change a parameter and test again
Updated by mkittler over 2 years ago
Unlike AC1 might let one assume, ow1 is still used in production. In fact, it seems to be the worker processing the most jobs.
It doesn't look like its failure/incomplete rate is significantly worse compared to the other workers:
openqa=> with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed' or result='incomplete') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where t_finished >= '2022-03-04' group by host order by ratio_failed_by_host desc;
host | ratio_failed_by_host | total
-------------------------+----------------------+-------
siodtw01 | 74.89 | 223
openqaworker1_container | 74.88 | 211
openqa-aarch64 | 53.37 | 5134
ip-10-252-32-90 | 37.92 | 654
ip-10-252-32-98 | 36.52 | 2141
power8 | 33.92 | 2621
imagetester | 27.87 | 854
openqaworker1 | 27.22 | 8599
openqaworker4 | 26.64 | 7669
openqaworker7 | 23.89 | 6860
| 1.44 | 2785
(11 Zeilen)
I'm ignoring openqaworker1_container
here because that's likely not a representative set of jobs.
Updated by okurz over 2 years ago
- Status changed from Workable to Resolved
- Assignee set to mkittler
mkittler wrote:
Unlike AC1 might let one assume, ow1 is still used in production. In fact, it seems to be the worker processing the most jobs.
That is correct. I wondered if maybe not all worker instances are enabled but the machine seems to be happily running 20 openQA worker instances so back to the all time high :)
It doesn't look like its failure/incomplete rate is significantly worse compared to the other workers:
[…]
That looks good. Thanks for crosschecking. I also crosschecked the raid config and everything looks to be ok. I would say that's enough then. So resolving the ticket and assigning mkittler as the main person driving this story to completion.