Project

General

Profile

Actions

action #53234

closed

all jobs on aarch64.o.o incompleted with "Permission denied" on /dev/hugepages, "others" had no r/w

Added by okurz almost 5 years ago. Updated 8 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
-
Target version:
Start date:
2019-06-18
Due date:
% Done:

0%

Estimated time:

Description

Observation

All jobs on aarch64.o.o went incomplete, e.g. see https://openqa.opensuse.org/tests/961910/file/autoinst-log.txt
stating

qemu-system-aarch64: can't open backing store /dev/hugepages/ for guest RAM: Permission denied

and the folder /dev/hugepages belonging to root:root had rwxrwxr-x so no write permission for "others"
but mount tells

hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,mode=1777)

so the mode should have been set correctly.

Actions #1

Updated by okurz almost 5 years ago

  • Status changed from New to In Progress

I have worked around manually with chmod o+w /dev/hugepages/ but I am not yet sure where the original problem comes from. I installed a page to find "core dumps", for other reasons, because irqbalance sigsegv's, and rebooted. Let's see how it looks after reboot.

Added to /etc/fstab:

hugetlbfs                                  /dev/hugepages          hugetlbfs mode=1777,rw,relatime                      0  0
Actions #2

Updated by okurz almost 5 years ago

didn't help

Actions #4

Updated by okurz almost 5 years ago

  • Status changed from In Progress to Feedback
  • Priority changed from Normal to Low

No one is aware of changes. I checked snapper diff 277..278 to see if any changes were done by the nightly update that can explain why the permissions should have changed but I could only find an openQA update.

Added

echo 'KERNEL=="hugepages*", OWNER="root", GROUP="root", MODE="0775"' > /etc/udev/rules.d/52-hugepages.rules

also does not help (over reboot). I deleted both the udev file and the /etc/fstab entry and added instead a workaround in /etc/rc.d/boot.local

cat -> /etc/rc.d/boot.local << EOF
#!/bin/sh -e
# okurz: 2019-06-18: https://progress.opensuse.org/issues/53234
logger "checking /dev/hugepages permissions in the beginning: $(stat -c%A /dev/hugepages/), see https://progress.opensuse.org/issues/53234 for details"
for i in {1..10} ; do
    if [ "$(stat -c%A /dev/hugepages/ | cut -c9)" != "w" ]; then
        logger "Correcting permissions on /dev/hugepages from /etc/rc.d/boot.local, see https://progress.opensuse.org/issues/53234 for details"
        chmod o+w /dev/hugepages/
        break
    fi
done
EOF
chmod +x /etc/rc.d/boot.local

that worked, see

# journalctl -u rc-local
-- Logs begin at Tue 2019-06-18 11:14:12 CEST, end at Tue 2019-06-18 11:14:53 CEST. --
Jun 18 11:14:23 openqa-aarch64 systemd[1]: Starting /etc/init.d/boot.local Compatibility...
Jun 18 11:14:23 openqa-aarch64 root[1539]: checking /dev/hugepages permissions in the beginning: drwxrwxrwt, see https://progress.opensuse.org/issues/53234 for details
Jun 18 11:14:28 openqa-aarch64 root[2224]: Correcting permissions on /dev/hugepages from /etc/rc.d/boot.local, see https://progress.opensuse.org/issues/53234 for details
Jun 18 11:14:28 openqa-aarch64 systemd[1]: Started /etc/init.d/boot.local Compatibility.

with the workaround in place we can wait for further insight, not knowing yet what is the actual problem. Maybe some other people have an idea.

To be reviewed after some time.

Actions #5

Updated by okurz almost 5 years ago

now it's getting more ugly. https://openqa.opensuse.org/tests/962467/file/autoinst-log.txt just failed "out of nowhere" while the worker host was running for multiple hours already. Seems like some thing changed during runtime of the host.

Fixed permissions manually and will try to follow tail -f /var/log/audit/audit.log.

2019-06-19: No problem so far this morning, permissions on /dev/hugepages/ are still intact.

Actions #6

Updated by okurz almost 5 years ago

Replaced /etc/rc.d/boot.local by openqa-hugepages-fix.service from #52850#note-13

Actions #8

Updated by okurz almost 5 years ago

aarch64.o.o was down this morning, no reaction on IPMI SOL, power cycled, booted but again permissions on /dev/hugepages/ were not set but the service ran. Monitoring if I can see anything later with tail -f /var/log/audit/audit.log | grep hugepages

Actions #9

Updated by okurz almost 5 years ago

[21/06/2019 09:55:24] <guillaume_g> okurz: aarch64 worker is broken again due to hugepages permissions: QEMU: qemu-system-aarch64: can't open backing store /dev/hugepages/ for guest RAM: Permission denied
[21/06/2019 09:55:42] <guillaume_g> okurz: could you have a look, please?
[21/06/2019 09:56:03] <okurz> I will
[21/06/2019 09:56:18] <guillaume_g> okurz: thanks. :)
[21/06/2019 09:57:37] <okurz> guillaume_g: do you have experience with setting permissions on /dev/hugepages on other machines? I have a systemd service in place now which triggers but I guess something else is updating permissions on /dev/hugepages in the wrong way and I do not know what thing that could be

I checked the output from the previous command tail -f /var/log/audit/audit.log and there was no output until the connection was reset due to the daily self-update cycle.

I put the correction script /etc/rc.d/boot.local in place as well. Let's see if my observation from #53234#note-5 repeats.

I crosschecked again where I can find "hugepages" referenced in /etc and found only the services I put it in myself. I checked if unmount/mounting breaks/fixes it and:

umount /dev/hugepages 
mount -a
ls -la /dev/hugepages/

all ok.

then I looked for other systemd service definitions and found it:

$ find /usr/lib/systemd/ -name '*.service' | xargs grep hugepages
/usr/lib/systemd/system/ovs-vswitchd.service:ExecStartPre=-/bin/sh -c '/usr/bin/chown :$${OVS_USER_ID##*:} /dev/hugepages'
/usr/lib/systemd/system/ovs-vswitchd.service:ExecStartPre=-/usr/bin/chmod 0775 /dev/hugepages

I found the upstream commit that introduced that:
https://github.com/openvswitch/ovs/commit/e3e738a3d0580a9a7178adfc9300a193b8df4ae5#diff-d9846707ff4b611f2ef841607aee9861R18

with the funny text "This change may be a bit controversial since it modifies /dev/hugepages as part of starting the ovs-vswitchd to set a hugetlbfs group ownership."

so I deleted /etc/rc.d/boot.local again and updated the "fix" systemd service to wait for ovs-vswitchd.service

--- a/etc/systemd/system/openqa-hugepages-fix.service   2019-06-21 10:48:54.747085115 +0200
+++ b/etc/systemd/system/openqa-hugepages-fix.service    2019-06-21 10:47:40.026825022 +0200
@@ -1,6 +1,7 @@
 [Unit]
 Description=Systemd service to fix hugepages + qemu ram problems. See https://progress.opensuse.org/issues/53234 for details
 After=dev-hugepages.mount
+After=ovs-vswitchd.service

 [Service]
 Type=simple

Can not report on bugzilla.opensuse.org as it's down currently (see #53396)

Actions #10

Updated by okurz almost 5 years ago

  • Status changed from Feedback to Blocked
Actions #11

Updated by okurz over 4 years ago

  • Status changed from Blocked to Resolved
  • Target version set to Done

no movement in bug, remaining tasks covered in #43934

Actions #12

Updated by favogt 8 months ago

For openqaworker-arm22 I used a slightly different method which does not need any hacks or workarounds. In /etc/fstab, add:

hugetlbfs /dev/hugepages hugetlbfs defaults,gid=kvm,mode=1775,pagesize=1G

The _openqa-worker user is member of the supplementary kvm group, so group writable is enough.

I guess with ovs this will break again if it changes the group. It should probably use ACLs instead...

Actions

Also available in: Atom PDF