action #162494
closed
telegraf error on some OSD controlled machines "W! [inputs.diskio] Error gathering disk info: no such file or directory" size:S
Added by okurz 10 months ago.
Updated about 1 month ago.
Category:
Regressions/Crashes
Description
Observation¶
sudo salt \* cmd.run 'journalctl -u telegraf | grep -c "inputs\.diskio"'
openqaworker18.qa.suse.cz:
0
backup-qam.qe.nue2.suse.org:
0
…
worker40.oqa.prg2.suse.org:
1896
…
worker33.oqa.prg2.suse.org:
4162
just as example, not complete.
Acceptance criteria¶
- AC1: No errors or warnings related to gathering disk info in telegraf journal
Acceptance tests¶
- AT1-1:
sudo salt \* cmd.run 'journalctl -u telegraf | grep -c "inputs\.diskio"'
is 0
Suggestions¶
- web research for
W! [inputs.diskio] Error gathering disk info: no such file or directory
and try telegraf -test
on machines to reproduce. Probably we need to exclude something from looking up disks that don't exist?
- Subject changed from telegraf error on some OSD controlled machines "W! [inputs.diskio] Error gathering disk info: no such file or directory" to telegraf error on some OSD controlled machines "W! [inputs.diskio] Error gathering disk info: no such file or directory" size:S
- Description updated (diff)
- Status changed from New to Workable
- Target version changed from Tools - Next to Ready
- Target version changed from Ready to Tools - Next
- Target version changed from Tools - Next to Ready
- Assignee set to jbaier_cz
- Status changed from Workable to In Progress
- Status changed from In Progress to Feedback
Maybe it is worth noting, that inputs.diskio.tagdrop
is useless in this case as the attempt to get info from the /dev
was already made at that point. Maybe specifying the devices instead the implicit devices=["*"]
might work, but there might be the risk of not collecting everything we want/need.
I've taken worker33 as example because you mention it as one of the machines causing problems and indeed I found your mentioned device-node:
worker33:~ # ls -lah /sys/class/block/nvme?c*
lrwxrwxrwx 1 root root 0 Feb 23 03:34 /sys/class/block/nvme1c1n1 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.0/nvme/nvme1/nvme1c1n1
unfortunately I couldn't figure out what exactly in udev causes this node to be created. I checked what we have on worker33 according to the kernel docs to see, if we loose any functionality. All of the 3 NVMes in worker33 have just one namespace so there can't be any multi-pathing here (The NVMe multipath feature in Linux integrates namespaces with the same identifier into a single block device.
). The currently used policy is "NUMA" which indicates shorter paths in multi CPU machines (The NUMA policy selects the path closest to the NUMA node of the current CPU
) which at least worker33 is not. I think this is why we can currently go ahead with the proposed change but can imagine several options to look into:
- Check out if the
nvme
-tool has a way to disable this feature on the disk itself so udev does not create this node
- Dig into udev if there is a way to avoid creating these nodes by e.g. setting a flag/env-variable (
/usr/lib/udev/rules.d/56-multipath.rules
could be interesting)
- Understand why telegraf considers them with
devices = ["*"]
- what does this *
-wildcard mean? Can it be influenced?
nicksinger wrote in #note-10:
- Check out if the
nvme
-tool has a way to disable this feature on the disk itself so udev does not create this node
I even wasn't able to find out a way to list those, but I didn't read the documentation much.
- Dig into udev if there is a way to avoid creating these nodes by e.g. setting a flag/env-variable (
/usr/lib/udev/rules.d/56-multipath.rules
could be interesting)
Good idea. That would be probably more elegant than disabling it directly in the kernel.
- Understand why telegraf considers them with
devices = ["*"]
- what does this *
-wildcard mean? Can it be influenced?
I can answer that right away, telegraf scans /sys/block (in a newer version it will read /sys/class/block) for devices according to the mask. All entries there are considered as a block device and used (unfortunately as we can see, not every device there has a corresponding /dev entry and we see the result). To me it looks like a bug in telegraph, but I am no expert in this area to decide that.
- Status changed from Feedback to In Progress
- Status changed from In Progress to Feedback
I did a test on worker39. Bellow is the log from udev about the nvme0c0n1 device. To me it looks like it is added as a part of standard nvme.
udevadm test /sys/class/block/nvme0c0n1
...
nvme0c0n1: /usr/lib/udev/rules.d/56-multipath.rules:32 Importing properties from results of '/sbin/multipath -u nvme0c0n1'
nvme0c0n1: Starting '/sbin/multipath -u nvme0c0n1'
Successfully forked off '(spawn)' as PID 118661.
nvme0c0n1: Process '/sbin/multipath -u nvme0c0n1' failed with exit code 1.
nvme0c0n1: /usr/lib/udev/rules.d/56-multipath.rules:32 Command "/sbin/multipath -u nvme0c0n1" returned 1 (error), ignoring
nvme0c0n1: /usr/lib/udev/rules.d/60-persistent-storage.rules:51 Replaced 1 slash(es) from result of ENV{ID_SERIAL}="$env{ID_MODEL}_$env{ID_SERIAL_SHORT}"
nvme0c0n1: /usr/lib/udev/rules.d/60-persistent-storage.rules:53 Replaced 1 slash(es) from result of ENV{ID_SERIAL}="$env{ID_MODEL}_$env{ID_SERIAL_SHORT}_$env{ID_NSID}"
nvme0c0n1: /usr/lib/udev/rules.d/60-persistent-storage.rules:110 Importing properties from results of builtin command 'path_id'
nvme0c0n1: /usr/lib/udev/rules.d/60-persistent-storage.rules:133 Importing properties from results of builtin command 'blkid'
nvme0c0n1: Failed to get device name: No such file or directory
nvme0c0n1: /usr/lib/udev/rules.d/60-persistent-storage.rules:133 Failed to run builtin 'blkid': No such file or directory
nvme0c0n1: /usr/lib/udev/rules.d/61-persistent-storage-compat.rules:48 Importing properties from '/usr/lib/udev/compat-symlink-generation'
nvme0c0n1: /usr/lib/udev/rules.d/90-iocost.rules:18 Importing properties from results of builtin command 'hwdb 'block::name:SAMSUNG MZPLJ6T4HALA-00007:fwrev:EPK9CB5Q:''
nvme0c0n1: No entry found from hwdb.
nvme0c0n1: /usr/lib/udev/rules.d/90-iocost.rules:18 Failed to run builtin 'hwdb 'block::name:SAMSUNG MZPLJ6T4HALA-00007:fwrev:EPK9CB5Q:'': No data available
nvme0c0n1: sd-device: Created db file '/run/udev/data/+block:nvme0c0n1' for '/devices/pci0000:80/0000:80:01.1/0000:81:00.0/nvme/nvme0/nvme0c0n1'
DEVPATH=/devices/pci0000:80/0000:80:01.1/0000:81:00.0/nvme/nvme0/nvme0c0n1
DEVTYPE=disk
DISKSEQ=1
ACTION=add
SUBSYSTEM=block
.SAVED_FM_WAIT_UNTIL=
ID_SERIAL_SHORT=S55KNC0TA00631
ID_WWN=eui.35354b3054a006310025384300000002
ID_MODEL=SAMSUNG MZPLJ6T4HALA-00007
ID_REVISION=EPK9CB5Q
ID_NSID=1
ID_SERIAL=SAMSUNG_MZPLJ6T4HALA-00007_S55KNC0TA00631_1
ID_PATH=pci-0000:81:00.0-nvme-1
ID_PATH_TAG=pci-0000_81_00_0-nvme-1
COMPAT_SYMLINK_GENERATION=2
.MODEL=SAMSUNG MZPLJ6T4HALA-00007
TAGS=:systemd:
CURRENT_TAGS=:systemd:
USEC_INITIALIZED=734347554766
Nevertheless, I added the kernel parameter to the grub configuration and after the reboot there are only devices which telegraf can handle.
- Status changed from Feedback to Resolved
MR merged and effective
sudo salt -t 10 \* cmd.run 'journalctl -u telegraf | grep -c "inputs\.diskio" | grep 2025-03-05'
is clean so I assume we are good
Also available in: Atom
PDF