action #162494: telegraf error on some OSD controlled machines "W! [inputs.diskio] Error gathering disk info: no such file or directory" size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

#1

Updated by okurz 8 months ago

Subject changed from telegraf error on some OSD controlled machines "W! [inputs.diskio] Error gathering disk info: no such file or directory" to telegraf error on some OSD controlled machines "W! [inputs.diskio] Error gathering disk info: no such file or directory" size:S
Description updated (diff)
Status changed from New to Workable

#2

Updated by tinita 7 months ago

Target version changed from Tools - Next to Ready

#3

Updated by okurz 7 months ago

Target version changed from Ready to Tools - Next

#4

Updated by okurz about 1 month ago

Target version changed from Tools - Next to Ready

#5

Updated by jbaier_cz about 1 month ago

Assignee set to jbaier_cz

#6

Updated by jbaier_cz about 1 month ago

Status changed from Workable to In Progress

#7

Updated by jbaier_cz about 1 month ago

Meh, this leads to telegraf trying to read a non-existent /dev/nvme0c0n1 (because there is /sys/block/nvme0c0n1 detected). But this is some sort of NVMe multipath thingy which is not supposed to have a corresponding dev node. So this looks like telegraf diskio plugin cannot handle that nicely.

I was able to find some similar problems in other software[^3].

It looks like we do have that feature enabled [^2] (and it probably is relevant only for some, usually the larger ones, disks), it probably is also enabled by default [1]:

#  cat /sys/module/nvme_core/parameters/multipath
Y

Afaik we do not really use it so one path out of it could be just disabling it? According to [^1] it should just be a kernel parameter.

[^1]: https://documentation.suse.com/sles/15-SP6/html/SLES-all/cha-nvmeof.html#sec-nvmeof-host-configuration-multipathing
[^2]: https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/8/html/configuring_device_mapper_multipath/enabling-multipathing-on-nvme-devices_configuring-device-mapper-multipath#proc_enabling-native-nvme-multipathing_enabling-multipathing-on-nvme-devices
[^3]: https://github.com/google/cadvisor/issues/3340

#8

Updated by jbaier_cz about 1 month ago

Status changed from In Progress to Feedback

#9

Updated by jbaier_cz about 1 month ago

#10

Updated by nicksinger about 1 month ago

I've taken worker33 as example because you mention it as one of the machines causing problems and indeed I found your mentioned device-node:

worker33:~ # ls -lah /sys/class/block/nvme?c*
lrwxrwxrwx 1 root root 0 Feb 23 03:34 /sys/class/block/nvme1c1n1 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.0/nvme/nvme1/nvme1c1n1

unfortunately I couldn't figure out what exactly in udev causes this node to be created. I checked what we have on worker33 according to the kernel docs to see, if we loose any functionality. All of the 3 NVMes in worker33 have just one namespace so there can't be any multi-pathing here (The NVMe multipath feature in Linux integrates namespaces with the same identifier into a single block device.). The currently used policy is "NUMA" which indicates shorter paths in multi CPU machines (The NUMA policy selects the path closest to the NUMA node of the current CPU) which at least worker33 is not. I think this is why we can currently go ahead with the proposed change but can imagine several options to look into:

Check out if the nvme-tool has a way to disable this feature on the disk itself so udev does not create this node
Dig into udev if there is a way to avoid creating these nodes by e.g. setting a flag/env-variable (/usr/lib/udev/rules.d/56-multipath.rules could be interesting)
Understand why telegraf considers them with devices = ["*"] - what does this *-wildcard mean? Can it be influenced?

#11

Updated by jbaier_cz about 1 month ago · Edited

#12

Updated by jbaier_cz about 1 month ago

Status changed from Feedback to In Progress

#13

Updated by jbaier_cz about 1 month ago

Status changed from In Progress to Feedback

I did a test on worker39. Bellow is the log from udev about the nvme0c0n1 device. To me it looks like it is added as a part of standard nvme.

udevadm test /sys/class/block/nvme0c0n1
...
nvme0c0n1: /usr/lib/udev/rules.d/56-multipath.rules:32 Importing properties from results of '/sbin/multipath -u nvme0c0n1'
nvme0c0n1: Starting '/sbin/multipath -u nvme0c0n1'
Successfully forked off '(spawn)' as PID 118661.
nvme0c0n1: Process '/sbin/multipath -u nvme0c0n1' failed with exit code 1.
nvme0c0n1: /usr/lib/udev/rules.d/56-multipath.rules:32 Command "/sbin/multipath -u nvme0c0n1" returned 1 (error), ignoring
nvme0c0n1: /usr/lib/udev/rules.d/60-persistent-storage.rules:51 Replaced 1 slash(es) from result of ENV{ID_SERIAL}="$env{ID_MODEL}_$env{ID_SERIAL_SHORT}"
nvme0c0n1: /usr/lib/udev/rules.d/60-persistent-storage.rules:53 Replaced 1 slash(es) from result of ENV{ID_SERIAL}="$env{ID_MODEL}_$env{ID_SERIAL_SHORT}_$env{ID_NSID}"
nvme0c0n1: /usr/lib/udev/rules.d/60-persistent-storage.rules:110 Importing properties from results of builtin command 'path_id'
nvme0c0n1: /usr/lib/udev/rules.d/60-persistent-storage.rules:133 Importing properties from results of builtin command 'blkid'
nvme0c0n1: Failed to get device name: No such file or directory
nvme0c0n1: /usr/lib/udev/rules.d/60-persistent-storage.rules:133 Failed to run builtin 'blkid': No such file or directory
nvme0c0n1: /usr/lib/udev/rules.d/61-persistent-storage-compat.rules:48 Importing properties from '/usr/lib/udev/compat-symlink-generation'
nvme0c0n1: /usr/lib/udev/rules.d/90-iocost.rules:18 Importing properties from results of builtin command 'hwdb 'block::name:SAMSUNG MZPLJ6T4HALA-00007:fwrev:EPK9CB5Q:''
nvme0c0n1: No entry found from hwdb.
nvme0c0n1: /usr/lib/udev/rules.d/90-iocost.rules:18 Failed to run builtin 'hwdb 'block::name:SAMSUNG MZPLJ6T4HALA-00007:fwrev:EPK9CB5Q:'': No data available
nvme0c0n1: sd-device: Created db file '/run/udev/data/+block:nvme0c0n1' for '/devices/pci0000:80/0000:80:01.1/0000:81:00.0/nvme/nvme0/nvme0c0n1'
DEVPATH=/devices/pci0000:80/0000:80:01.1/0000:81:00.0/nvme/nvme0/nvme0c0n1
DEVTYPE=disk
DISKSEQ=1
ACTION=add
SUBSYSTEM=block
.SAVED_FM_WAIT_UNTIL=
ID_SERIAL_SHORT=S55KNC0TA00631
ID_WWN=eui.35354b3054a006310025384300000002
ID_MODEL=SAMSUNG MZPLJ6T4HALA-00007
ID_REVISION=EPK9CB5Q
ID_NSID=1
ID_SERIAL=SAMSUNG_MZPLJ6T4HALA-00007_S55KNC0TA00631_1
ID_PATH=pci-0000:81:00.0-nvme-1
ID_PATH_TAG=pci-0000_81_00_0-nvme-1
COMPAT_SYMLINK_GENERATION=2
.MODEL=SAMSUNG MZPLJ6T4HALA-00007
TAGS=:systemd:
CURRENT_TAGS=:systemd:
USEC_INITIALIZED=734347554766

Nevertheless, I added the kernel parameter to the grub configuration and after the reboot there are only devices which telegraf can handle.

#14

Updated by okurz about 1 month ago

Status changed from Feedback to Resolved

Project

General

Tags

Custom queries

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

action #162494

telegraf error on some OSD controlled machines "W! [inputs.diskio] Error gathering disk info: no such file or directory" size:S

Observation¶

Acceptance criteria¶

Acceptance tests¶

Suggestions¶

Updated by okurz 8 months ago

Updated by tinita 7 months ago

Updated by okurz 7 months ago

Updated by okurz about 1 month ago

Updated by jbaier_cz about 1 month ago

Updated by jbaier_cz about 1 month ago

Updated by jbaier_cz about 1 month ago

Updated by jbaier_cz about 1 month ago

Updated by jbaier_cz about 1 month ago

Updated by nicksinger about 1 month ago

Updated by jbaier_cz about 1 month ago · Edited

Updated by jbaier_cz about 1 month ago

Updated by jbaier_cz about 1 month ago

Updated by okurz about 1 month ago