action #53234
closedall jobs on aarch64.o.o incompleted with "Permission denied" on /dev/hugepages, "others" had no r/w
Added by okurz over 5 years ago. Updated about 1 year ago.
0%
Description
Observation¶
All jobs on aarch64.o.o went incomplete, e.g. see https://openqa.opensuse.org/tests/961910/file/autoinst-log.txt
stating
qemu-system-aarch64: can't open backing store /dev/hugepages/ for guest RAM: Permission denied
and the folder /dev/hugepages belonging to root:root had rwxrwxr-x
so no write permission for "others"
but mount
tells
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,mode=1777)
so the mode should have been set correctly.
Updated by okurz over 5 years ago
- Status changed from New to In Progress
I have worked around manually with chmod o+w /dev/hugepages/
but I am not yet sure where the original problem comes from. I installed a page to find "core dumps", for other reasons, because irqbalance sigsegv's, and rebooted. Let's see how it looks after reboot.
Added to /etc/fstab:
hugetlbfs /dev/hugepages hugetlbfs mode=1777,rw,relatime 0 0
Updated by okurz over 5 years ago
- Status changed from In Progress to Feedback
- Priority changed from Normal to Low
No one is aware of changes. I checked snapper diff 277..278
to see if any changes were done by the nightly update that can explain why the permissions should have changed but I could only find an openQA update.
Added
echo 'KERNEL=="hugepages*", OWNER="root", GROUP="root", MODE="0775"' > /etc/udev/rules.d/52-hugepages.rules
also does not help (over reboot). I deleted both the udev file and the /etc/fstab entry and added instead a workaround in /etc/rc.d/boot.local
cat -> /etc/rc.d/boot.local << EOF
#!/bin/sh -e
# okurz: 2019-06-18: https://progress.opensuse.org/issues/53234
logger "checking /dev/hugepages permissions in the beginning: $(stat -c%A /dev/hugepages/), see https://progress.opensuse.org/issues/53234 for details"
for i in {1..10} ; do
if [ "$(stat -c%A /dev/hugepages/ | cut -c9)" != "w" ]; then
logger "Correcting permissions on /dev/hugepages from /etc/rc.d/boot.local, see https://progress.opensuse.org/issues/53234 for details"
chmod o+w /dev/hugepages/
break
fi
done
EOF
chmod +x /etc/rc.d/boot.local
that worked, see
# journalctl -u rc-local
-- Logs begin at Tue 2019-06-18 11:14:12 CEST, end at Tue 2019-06-18 11:14:53 CEST. --
Jun 18 11:14:23 openqa-aarch64 systemd[1]: Starting /etc/init.d/boot.local Compatibility...
Jun 18 11:14:23 openqa-aarch64 root[1539]: checking /dev/hugepages permissions in the beginning: drwxrwxrwt, see https://progress.opensuse.org/issues/53234 for details
Jun 18 11:14:28 openqa-aarch64 root[2224]: Correcting permissions on /dev/hugepages from /etc/rc.d/boot.local, see https://progress.opensuse.org/issues/53234 for details
Jun 18 11:14:28 openqa-aarch64 systemd[1]: Started /etc/init.d/boot.local Compatibility.
with the workaround in place we can wait for further insight, not knowing yet what is the actual problem. Maybe some other people have an idea.
To be reviewed after some time.
Updated by okurz over 5 years ago
now it's getting more ugly. https://openqa.opensuse.org/tests/962467/file/autoinst-log.txt just failed "out of nowhere" while the worker host was running for multiple hours already. Seems like some thing changed during runtime of the host.
Fixed permissions manually and will try to follow tail -f /var/log/audit/audit.log
.
2019-06-19: No problem so far this morning, permissions on /dev/hugepages/ are still intact.
Updated by okurz over 5 years ago
Replaced /etc/rc.d/boot.local by openqa-hugepages-fix.service from #52850#note-13
Updated by okurz over 5 years ago
aarch64.o.o was down this morning, no reaction on IPMI SOL, power cycled, booted but again permissions on /dev/hugepages/ were not set but the service ran. Monitoring if I can see anything later with tail -f /var/log/audit/audit.log | grep hugepages
Updated by okurz over 5 years ago
[21/06/2019 09:55:24] <guillaume_g> okurz: aarch64 worker is broken again due to hugepages permissions: QEMU: qemu-system-aarch64: can't open backing store /dev/hugepages/ for guest RAM: Permission denied
[21/06/2019 09:55:42] <guillaume_g> okurz: could you have a look, please?
[21/06/2019 09:56:03] <okurz> I will
[21/06/2019 09:56:18] <guillaume_g> okurz: thanks. :)
[21/06/2019 09:57:37] <okurz> guillaume_g: do you have experience with setting permissions on /dev/hugepages on other machines? I have a systemd service in place now which triggers but I guess something else is updating permissions on /dev/hugepages in the wrong way and I do not know what thing that could be
I checked the output from the previous command tail -f /var/log/audit/audit.log
and there was no output until the connection was reset due to the daily self-update cycle.
I put the correction script /etc/rc.d/boot.local in place as well. Let's see if my observation from #53234#note-5 repeats.
I crosschecked again where I can find "hugepages" referenced in /etc and found only the services I put it in myself. I checked if unmount/mounting breaks/fixes it and:
umount /dev/hugepages
mount -a
ls -la /dev/hugepages/
all ok.
then I looked for other systemd service definitions and found it:
$ find /usr/lib/systemd/ -name '*.service' | xargs grep hugepages
/usr/lib/systemd/system/ovs-vswitchd.service:ExecStartPre=-/bin/sh -c '/usr/bin/chown :$${OVS_USER_ID##*:} /dev/hugepages'
/usr/lib/systemd/system/ovs-vswitchd.service:ExecStartPre=-/usr/bin/chmod 0775 /dev/hugepages
I found the upstream commit that introduced that:
https://github.com/openvswitch/ovs/commit/e3e738a3d0580a9a7178adfc9300a193b8df4ae5#diff-d9846707ff4b611f2ef841607aee9861R18
with the funny text "This change may be a bit controversial since it modifies /dev/hugepages as part of starting the ovs-vswitchd to set a hugetlbfs group ownership."
so I deleted /etc/rc.d/boot.local again and updated the "fix" systemd service to wait for ovs-vswitchd.service
--- a/etc/systemd/system/openqa-hugepages-fix.service 2019-06-21 10:48:54.747085115 +0200
+++ b/etc/systemd/system/openqa-hugepages-fix.service 2019-06-21 10:47:40.026825022 +0200
@@ -1,6 +1,7 @@
[Unit]
Description=Systemd service to fix hugepages + qemu ram problems. See https://progress.opensuse.org/issues/53234 for details
After=dev-hugepages.mount
+After=ovs-vswitchd.service
[Service]
Type=simple
Can not report on bugzilla.opensuse.org as it's down currently (see #53396)
Updated by okurz over 5 years ago
- Status changed from Feedback to Blocked
Updated by okurz about 5 years ago
- Status changed from Blocked to Resolved
- Target version set to Done
no movement in bug, remaining tasks covered in #43934
Updated by favogt about 1 year ago
For openqaworker-arm22 I used a slightly different method which does not need any hacks or workarounds. In /etc/fstab
, add:
hugetlbfs /dev/hugepages hugetlbfs defaults,gid=kvm,mode=1775,pagesize=1G
The _openqa-worker
user is member of the supplementary kvm
group, so group writable is enough.
I guess with ovs this will break again if it changes the group. It should probably use ACLs instead...