Project

General

Profile

Actions

action #162365

closed

OSD can fail on xfs_repair OOM conditions size:S

Added by okurz about 1 month ago. Updated 21 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2024-06-17
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

#162332-7 Research for the xfs OOM issues and try to come up with a way so that the system triggered filesystem checks don't fail on OOM

Suggestions

  • Consider systemd resource limits to avoid the OOM condition e.g. 75% memory

Related issues 2 (0 open2 closed)

Copied from openQA Infrastructure - action #162362: 2024-06-15 osd not accessible - ensure healthy filesystems size:SResolvedokurz2024-06-17

Actions
Copied to openQA Infrastructure - action #162590: NFS mounts are stuck on OSD workers if partitions on OSD fail to come up properly on boot size:SResolvedokurz2024-06-17

Actions
Actions #1

Updated by okurz about 1 month ago

  • Copied from action #162362: 2024-06-15 osd not accessible - ensure healthy filesystems size:S added
Actions #2

Updated by jbaier_cz 30 days ago

  • Assignee set to jbaier_cz
Actions #3

Updated by jbaier_cz 30 days ago

From what I could find out, the on-boot checks are triggered by systemd-fsck@.service which simply invokes fsck.xfs for the problematic filesystems. According to man however

XFS is a journaling filesystem and performs recovery at mount(8) time if necessary, so fsck.xfs simply exits with a zero exit status. However, the system administrator can force fsck.xfs to run xfs_repair(8) at boot time by creating a /forcefsck file or booting the system with "fsck.mode=force" on the kernel command line.

So there are no on-boot checks for our xfs filesystem unless someone asks for them explicitly. That might be probably the case this time as suggested in the log.

Jun 16 10:46:51 openqa systemd[1]: Starting File System Check on /dev/disk/by-uuid/b5377fcf-6273-4f38-a471-cefff08c60b7...
Jun 16 10:46:51 openqa systemd[1]: Starting File System Check on /dev/disk/by-uuid/0b735e2d-b300-4a37-b39b-65fdc41d1ad1...
Jun 16 10:46:51 openqa systemd[1]: Starting File System Check on /dev/disk/by-uuid/40c5b7dd-487f-442b-ba95-b6186b18f3a7...
Jun 16 10:46:51 openqa systemd-fsck[715]: Please pass 'fsck.mode=force' on the kernel command line rather than creating /forcefsck on the root file system.
Jun 16 10:46:51 openqa systemd[1]: Starting File System Check on /dev/disk/by-uuid/37972261-7411-44eb-bae9-90d1cc1c9a38...
Jun 16 10:46:51 openqa systemd-fsck[717]: Please pass 'fsck.mode=force' on the kernel command line rather than creating /forcefsck on the root file system.
Jun 16 10:46:51 openqa systemd-fsck[720]: Please pass 'fsck.mode=force' on the kernel command line rather than creating /forcefsck on the root file system.
Jun 16 10:46:51 openqa systemd-fsck[722]: Please pass 'fsck.mode=force' on the kernel command line rather than creating /forcefsck on the root file system.
Jun 16 10:46:51 openqa systemd-fsck[735]: /sbin/fsck.xfs error: The filesystem log is dirty, mount it to recover the log. If that fails, refer to the section DIRTY LOGS in the xfs_repair manual page.
Jun 16 10:46:51 openqa systemd-fsck[717]: fsck failed with exit status 4.
Jun 16 10:46:51 openqa systemd-fsck[717]: Running request emergency.target/start/replace
Jun 16 10:46:51 openqa systemd[1]: systemd-fsck@dev-disk-by\x2duuid-0b735e2d\x2db300\x2d4a37\x2db39b\x2d65fdc41d1ad1.service: Main process exited, code=exited, status=1/FAILURE
Jun 16 10:46:51 openqa systemd[1]: systemd-fsck@dev-disk-by\x2duuid-0b735e2d\x2db300\x2d4a37\x2db39b\x2d65fdc41d1ad1.service: Failed with result 'exit-code'.
Jun 16 10:46:51 openqa systemd[1]: Failed to start File System Check on /dev/disk/by-uuid/0b735e2d-b300-4a37-b39b-65fdc41d1ad1.
Jun 16 10:46:51 openqa systemd[1]: Dependency failed for /space-slow.
Jun 16 10:46:51 openqa systemd[1]: Dependency failed for Local File Systems.
Jun 16 10:46:51 openqa systemd[1]: local-fs.target: Job local-fs.target/start failed with result 'dependency'.
Jun 16 10:46:51 openqa systemd[1]: local-fs.target: Triggering OnFailure= dependencies.
Jun 16 10:46:51 openqa systemd[1]: space\x2dslow.mount: Job space\x2dslow.mount/start failed with result 'dependency'.
Jun 16 10:46:51 openqa systemd[1]: Finished File System Check on /dev/disk/by-uuid/40c5b7dd-487f-442b-ba95-b6186b18f3a7.
Jun 16 10:46:51 openqa systemd[1]: Mounting /srv...
Jun 16 10:47:03 openqa systemd[1]: Finished File System Check on /dev/disk/by-uuid/37972261-7411-44eb-bae9-90d1cc1c9a38.
Jun 16 10:47:03 openqa systemd[1]: Mounting /assets...

Unfortunately, the machine was rebooted again before the checks were completed.

The next run was also unsuccessful due to the OOM

Jun 16 10:53:29 openqa kernel: xfs_repair invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Jun 16 10:53:29 openqa kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/system-systemd\x2dfsck.slice,task=xfs_repair,pid=736,uid=0
Jun 16 10:53:29 openqa kernel: Out of memory: Killed process 736 (xfs_repair) total-vm:22753620kB, anon-rss:22335672kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:44040kB oom_score_adj:0
Jun 16 10:53:31 openqa kernel: oom_reaper: reaped process 736 (xfs_repair), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jun 16 10:53:31 openqa systemd-fsck[729]: /sbin/fsck.xfs: line 69:   736 Killed                  xfs_repair -e $DEV
Jun 16 10:53:31 openqa systemd-fsck[729]: /sbin/fsck.xfs error: An unknown return code from xfs_repair '137'
Jun 16 10:53:31 openqa systemd-fsck[715]: fsck failed with exit status 4.
Jun 16 10:53:31 openqa systemd[1]: systemd-fsck@dev-disk-by\x2duuid-b5377fcf\x2d6273\x2d4f38\x2da471\x2dcefff08c60b7.service: Main process exited, code=exited, status=1/FAILURE
Jun 16 10:53:31 openqa systemd[1]: systemd-fsck@dev-disk-by\x2duuid-b5377fcf\x2d6273\x2d4f38\x2da471\x2dcefff08c60b7.service: Failed with result 'exit-code'.
Jun 16 10:53:31 openqa systemd[1]: Failed to start File System Check on /dev/disk/by-uuid/b5377fcf-6273-4f38-a471-cefff08c60b7.
Jun 16 10:53:31 openqa systemd[1]: Dependency failed for /results.
Jun 16 10:53:31 openqa systemd[1]: Dependency failed for Local File Systems.
Jun 16 10:53:31 openqa systemd[1]: local-fs.target: Job local-fs.target/start failed with result 'dependency'.
Jun 16 10:53:31 openqa systemd[1]: local-fs.target: Triggering OnFailure= dependencies.
Jun 16 10:53:31 openqa systemd[1]: Dependency failed for /var/lib/openqa.
Jun 16 10:53:31 openqa systemd[1]: Dependency failed for var-lib-openqa-share.automount.
Jun 16 10:53:31 openqa systemd[1]: var-lib-openqa-share.automount: Job var-lib-openqa-share.automount/start failed with result 'dependency'.
Jun 16 10:53:31 openqa systemd[1]: Dependency failed for var-lib-openqa-share-factory-hdd-fixed.automount.
Jun 16 10:53:31 openqa systemd[1]: var-lib-openqa-share-factory-hdd-fixed.automount: Job var-lib-openqa-share-factory-hdd-fixed.automount/start failed with result 'dependency'.
Jun 16 10:53:31 openqa systemd[1]: Dependency failed for var-lib-openqa-archive.automount.
Jun 16 10:53:31 openqa systemd[1]: var-lib-openqa-archive.automount: Job var-lib-openqa-archive.automount/start failed with result 'dependency'.
Jun 16 10:53:31 openqa systemd[1]: Dependency failed for /var/lib/openqa/share.
Jun 16 10:53:31 openqa systemd[1]: Dependency failed for var-lib-openqa-share-factory-iso-fixed.automount.
Jun 16 10:53:31 openqa systemd[1]: var-lib-openqa-share-factory-iso-fixed.automount: Job var-lib-openqa-share-factory-iso-fixed.automount/start failed with result 'dependency'.
Jun 16 10:53:31 openqa systemd[1]: var-lib-openqa-share.mount: Job var-lib-openqa-share.mount/start failed with result 'dependency'.
Jun 16 10:53:31 openqa systemd[1]: var-lib-openqa.mount: Job var-lib-openqa.mount/start failed with result 'dependency'.
Jun 16 10:53:31 openqa systemd[1]: results.mount: Job results.mount/start failed with result 'dependency'.

The machine was again rebooted before the last running check was able to complete. This is also the case for a few (5) subsequent boots.

Jun 16 11:41:34 openqa kernel: Linux version 5.14.21-150500.55.65-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.41.0.20230908-150100.7.46) #1 SMP PREEMPT_DYNAMIC Thu May 23 04:57:11 UTC 2024 (a46829d)
Jun 16 11:59:35 openqa kernel: Linux version 5.14.21-150500.55.65-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.41.0.20230908-150100.7.46) #1 SMP PREEMPT_DYNAMIC Thu May 23 04:57:11 UTC 2024 (a46829d)
Jun 17 09:00:03 openqa kernel: Linux version 5.14.21-150500.55.65-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.41.0.20230908-150100.7.46) #1 SMP PREEMPT_DYNAMIC Thu May 23 04:57:11 UTC 2024 (a46829d)
Jun 17 09:03:15 openqa kernel: Linux version 5.14.21-150500.55.65-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.41.0.20230908-150100.7.46) #1 SMP PREEMPT_DYNAMIC Thu May 23 04:57:11 UTC 2024 (a46829d)
Jun 17 09:33:03 openqa kernel: Linux version 5.14.21-150500.55.65-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.41.0.20230908-150100.7.46) #1 SMP PREEMPT_DYNAMIC Thu May 23 04:57:11 UTC 2024 (a46829d)

There is a change in the subsequent boot when for some reason the main filesystem remained read-only which introduced more service failures

Jun 17 09:45:55 openqa kernel: Linux version 5.14.21-150500.55.65-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.41.0.20230908-150100.7.46) #1 SMP PREEMPT_DYNAMIC Thu May 23 04:57:11 UTC 2024 (a46829d)
Jun 17 09:45:58 openqa systemd-random-seed[638]: Failed to remove extended attribute, ignoring: Read-only file system
Jun 17 09:46:17 openqa systemd-tmpfiles[1157]: rm_rf(/tmp): Read-only file system
...
Jun 17 09:46:18 openqa (genrules)[1169]: augenrules.service: Failed to set up mount namespacing: /run/systemd/unit-root/dev: Read-only file system
Jun 17 09:46:18 openqa (genrules)[1169]: augenrules.service: Failed at step NAMESPACE spawning /sbin/augenrules: Read-only file system
Jun 17 09:46:18 openqa systemd[1]: Failed to start auditd rules generation.
Jun 17 09:46:18 openqa wickedd[1255]: unable to create unique tempfile in /tmp
Jun 17 09:46:18 openqa wickedd[1255]: __ni_objectmodel_write_message: unable to create tempfile for script arguments
Jun 17 09:46:18 openqa dbus-daemon[1170]: [system] Rejected send message, 0 matched rules; type="error", sender=":1.8" (uid=0 pid=1255 comm="/usr/sbin/wickedd --systemd --foreground ") interface="(unset)" member="(unset)" error name="org.freedesktop.DBus.Error.Failed" requested_rep>
Jun 17 09:46:18 openqa wickedd-nanny[1259]: device eth0: call to org.opensuse.Network.Firewall.firewallUp() failed: Invalid arguments
Jun 17 09:46:18 openqa wickedd-nanny[1259]: eth0: failed to bring up device, still continuing
...
Jun 17 09:46:24 openqa systemd[1]: Failed to start PostgreSQL database server.
Jun 17 09:46:24 openqa systemd[1]: openqa-gru.service: Job openqa-gru.service/start failed with result 'dependency'.
Jun 17 09:46:24 openqa nrpe[1474]: Starting up daemon
Jun 17 09:46:24 openqa systemd[1]: Dependency failed for Setup local PostgreSQL database for openQA.
...
Jun 17 09:46:25 openqa nfsdcltrack[1529]: Unexpected error when checking access on /var/lib/nfs/nfsdcltrack: Read-only file system
Jun 17 09:46:25 openqa kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
...
Jun 17 09:46:27 openqa salt-master[1368]: [ERROR   ] An un-handled exception was caught by Salt's global exception handler:
Jun 17 09:46:27 openqa salt-master[1368]: OSError: [Errno 30] Read-only file system: '/var/cache/salt/master/.root_key'
Jun 17 09:46:27 openqa salt-master[1368]: Traceback (most recent call last):
Jun 17 09:46:27 openqa salt-master[1368]:   File "/usr/bin/salt-master", line 22, in <module>
Jun 17 09:46:27 openqa salt-master[1368]:     salt_master()
...
Jun 17 09:46:27 openqa systemd[1]: salt-master.service: Main process exited, code=exited, status=1/FAILURE
Jun 17 09:46:27 openqa systemd[1]: salt-master.service: Failed with result 'exit-code'.
...
Jun 17 09:52:52 openqa umount[6390]: umount: /var/lib/openqa/share/factory/iso/fixed: not mounted.
Jun 17 09:52:52 openqa systemd[1]: assets-factory-hdd-fixed.mount: Deactivated successfully.
Jun 17 09:52:52 openqa systemd[1]: Unmounted /assets/factory/hdd/fixed.
Jun 17 09:52:52 openqa systemd[1]: assets-factory-iso-fixed.mount: Deactivated successfully.
Jun 17 09:52:52 openqa systemd[1]: Unmounted /assets/factory/iso/fixed.
Jun 17 09:52:52 openqa systemd[1]: var-lib-openqa-share-factory-hdd-fixed.mount: Deactivated successfully.
Jun 17 09:52:52 openqa systemd[1]: Unmounted /var/lib/openqa/share/factory/hdd/fixed.
Jun 17 09:52:52 openqa systemd[1]: var-lib-openqa-share-factory-iso-fixed.mount: Mount process exited, code=exited, status=32/n/a
Jun 17 09:52:52 openqa systemd[1]: var-lib-openqa-share-factory-iso-fixed.mount: Failed with result 'exit-code'.
Jun 17 09:52:52 openqa systemd[1]: Unmounted /var/lib/openqa/share/factory/iso/fixed.
Jun 17 09:52:52 openqa systemd[1]: var-lib-pgsql.mount: Deactivated successfully.
Jun 17 09:52:52 openqa systemd[1]: Unmounted /var/lib/pgsql.
Jun 17 09:52:52 openqa systemd[1]: var-lib-openqa-share-factory-hdd-fixed.automount: Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?
Jun 17 09:52:52 openqa systemd[1]: var-lib-openqa-share-factory-hdd-fixed.automount: Failed with result 'unmounted'.
Jun 17 09:52:52 openqa systemd[1]: Unset automount var-lib-openqa-share-factory-hdd-fixed.automount.

I guess this is also the run where the 2 problematic mount points were disabled from the fstab so openQA started without the proper mounts being present (and created some further problems which later prevented the proper mount).

If the fsck is forced during the boot the fsck.xfs just invokes xfs_repair on the device:

if $FORCE; then
        xfs_repair -e $DEV
        repair2fsck_code $?
        exit $?
fi

Normally, the implicit memory limit shouldn't be a problem (see man page)

   -m maxmem
          Specifies the approximate maximum amount of memory, in megabytes, to use for xfs_repair.  xfs_repair has its own internal block cache which will scale out up to the lesser of the process's virtual address limit or about 75% of the system's physical  RAM.   This
          option overrides these limits.

          NOTE: These memory limits are only approximate and may use more than the specified limit.

However, the forced checked during the boot can all run in parallel (as documented in man)

The sixth field (fs_passno).
This field is used by fsck(8) to determine the order in which filesystem checks are done at boot time. The root filesystem should be specified with a fs_passno of 1. Other filesystems should have a fs_passno of 2. Filesystems within a drive will be checked
sequentially, but filesystems on different drives will be checked at the same time to utilize parallelism available in the hardware. Defaults to zero (don’t check the filesystem) if not present.

Also the state in on of the OOM logs suggests there are actually two xfs_repair processes running (and if both have 75% of available memory, thing can go wrong).

Jun 16 11:42:29 openqa kernel: Tasks state (memory values in pages):
Jun 16 11:42:29 openqa kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Jun 16 11:42:29 openqa kernel: [    456]     0   456     1978      573    53248       14             0 plymouthd
Jun 16 11:42:29 openqa kernel: [    624]     0   624     2380       48    57344        0         -1000 dmeventd
Jun 16 11:42:29 openqa kernel: [    634]     0   634    22930      646   225280      140          -250 systemd-journal
Jun 16 11:42:29 openqa kernel: [    655]     0   655    15694      702   163840        0         -1000 systemd-udevd
Jun 16 11:42:29 openqa kernel: [    708]     0   708     9550        2   139264      217             0 systemd-fsck
Jun 16 11:42:29 openqa kernel: [    712]     0   712     2120        1    61440       63             0 fsck
Jun 16 11:42:29 openqa kernel: [    717]     0   717     9550        1   143360      217             0 systemd-fsck
Jun 16 11:42:29 openqa kernel: [    727]     0   727     2119      928    53248      142             0 haveged
Jun 16 11:42:29 openqa kernel: [    728]     0   728     2120        1    57344       63             0 fsck
Jun 16 11:42:29 openqa kernel: [    730]     0   730     6369        0   106496      204             0 fsck.xfs
Jun 16 11:42:29 openqa kernel: [    732]     0   732  4638901  4530153 36679680    34715             0 xfs_repair
Jun 16 11:42:29 openqa kernel: [    734]     0   734     6369        0   102400      204             0 fsck.xfs
Jun 16 11:42:29 openqa kernel: [    737]     0   737  3114710  2958462 24522752    88914             0 xfs_repair

The recommendation seems to be simply do not run forced parallel on-boot checks for XFS and when needed do it explicitly from the emergency shell.

Actions #4

Updated by okurz 30 days ago

So is your assumption that someone from IT forced an XFS check for multiple XFS filesystems in parallel by creating /forcefsck files? I consider it more likely that there is some systemd magic creating that file or that the warning about rather using a command line parameter is misleading. In any case consider getting in contact with the according persons from IT and ask how they triggered XFS checks if at all and if not then coordinate when to do monitored reboot attempts.

Actions #5

Updated by jbaier_cz 30 days ago

Yeah it could be misleading (although according to the code it does not look like a systemd magic here), nevertheless I asked that question in SD-159799.

For my part, I ran xfs_repair from the host (I had no access from within the VM at the time.)

Could that flag the filesystems for another round of fsck during next boot?

Actions #6

Updated by jbaier_cz 27 days ago

Also the state in on of the OOM logs suggests there are actually two xfs_repair processes running (and if both have 75% of available memory, thing can go wrong).

I guess #162332#note-19 proves this hypothesis. The question remains why are the filesystem checks running on every boot. Could that be that they did not run successfully and the FS is marked as dirty? Is the man page about running fsck just wrong?

Actions #7

Updated by okurz 27 days ago

jbaier_cz wrote in #note-6:

The question remains why are the filesystem checks running on every boot. Could that be that they did not run successfully and the FS is marked as dirty?

Yes, I am pretty sure that is what happened. There could be multiple reasons why a filesystem is marked as dirty and needing checks like an unclean shutdown. I would not be concerned about why that happened but expect such situations to happen in the future. I guess what happens is this: At least two XFS filesystems trigger filesystem checks because of "2" at the EOL in /etc/fstab for all and all in parallel. Maybe it works though undocumented to use individual, different numbers to serialize the operations. Also I wonder about the "1" for /home if that shouldn't be a 2 or different number as well.

You could check in a simulated environment or maybe

qemu-system-x86_64 -m 8192 -snapshot -hda /dev/vda -hdb /dev/vdb -hdc /dev/vdc -hdd /dev/vdd -hde /dev/vde -nographic -serial mon:stdio -smp 4

or try to patch xfs_repair calls to limit memory or disable the boot-up filesystem checks

Actions #8

Updated by okurz 27 days ago

  • Copied to action #162590: NFS mounts are stuck on OSD workers if partitions on OSD fail to come up properly on boot size:S added
Actions #9

Updated by jbaier_cz 27 days ago

Did some additional research about fs_passno for numbers > 2. It is generally allowed to use higher numbers, however:

This looks like patching the /sbin/fsck.xfs would be the easiest solution. We just need to do it in some not-so-hacky way.

Actions #10

Updated by livdywan 27 days ago

  • Subject changed from OSD can fail on xfs_repair OOM conditions to OSD can fail on xfs_repair OOM conditions size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #11

Updated by jbaier_cz 26 days ago

I did some more research and I stumbled upon rather old debian bug, where a similar problem was discussed (in this case it is parallel fsck on multiple LVM volumes backed by the same disks). It looks like this problem is known and there is no real plan to solve the issue. However there are two potential ideas:

  • comment 81 suggests to use fsck -A instead of systemd services

    Another normal (or maybe important) bug for systemd to not launch
    fsck's for LVM or MD devices in parallel, wait for all other running
    fscks to finish, then launch one and only one fsck for LVM or MD, etc.;
    thats what fsck -A does, if systemd cannot do it better then just use
    fsck -A that works correctly, don't reinvent an inferior wheel!

  • comment 59 suggests to ignore the issue or disable the systemd services

    My recommendation is to ignore this issue, or if you really see any
    performance problem than disable fsck by systemd and use by hands
    written script to call fsck.

Actions #12

Updated by jbaier_cz 21 days ago

In accordance to #162362#note-11 this MR should disable parallel checks.

Actions #13

Updated by jbaier_cz 21 days ago

  • Status changed from Workable to Resolved

Merged and deployed.

Actions

Also available in: Atom PDF