Project

General

Profile

Actions

action #162590

closed

NFS mounts are stuck on OSD workers if partitions on OSD fail to come up properly on boot size:S

Added by okurz 27 days ago. Updated 22 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2024-06-17
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

As observed in #162365 when mounts on OSD are not coming up properly and also because we don't want to treat them critical for boot anymore. What happens is likely that nfs-server.service on OSD starts on boot even if /var/lib/openqa/share is not mounted yet. Then clients, e.g. worker40 and others, are connected and served, then on OSD /var/lib/openqa/share is mounted over the already existing directory causing clients to misbehave as reported in https://suse.slack.com/archives/C02CANHLANP/p1718811455420459 by acarvajal. Probably we can ensure that nfs-server only starts up after /var/lib/openqa/share is completely available, either by explicit systemd unit requirements added or by providing the underlying mount points instead of bind mount directories

Acceptance criteria

  • AC1: ls /var/lib/openqa/share/ lists content on OSD workers using that directory from NFS exports after OSD servers

Suggestions

  • Verify in production with a planned and monitored OSD reboot (after the according sibling ticket about xfs_repair OOM) or try to reproduce the problem on the server side on OSD with qemu-system-x86_64 -m 8192 -snapshot -drive file=/dev/vda,if=virtio -drive file=/dev/vdb,if=virtio -drive file=/dev/vdc,if=virtio -drive file=/dev/vdd,if=virtio -drive file=/dev/vde,if=virtio -nographic -serial mon:stdio -smp 4 and trying to access the NFS mount from within that VM. Or try to reproduce the problem in plain VMs
  • Research upstream about NFS server and systemd units and dependencies on mount points
  • See if we can ensure services start after mount points are accessible, specifically /var/lib/openqa/share before nfs-server is started, likely with a systemd override file adding a "RequiresMountFor", like https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1207/diffs#fb557f9ca291facc4d54992e48f7126c56c74208_442_448

Rollback steps


Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #163097: Share mount not working on openqaworker-arm-1 and other workers size:MResolvedmkittler2024-07-17

Actions
Copied from openQA Infrastructure - action #162365: OSD can fail on xfs_repair OOM conditions size:SResolvedjbaier_cz2024-06-17

Actions
Actions

Also available in: Atom PDF