action #162356
closed
Treat OSD non-root mounts as non-critical for boot size:M
Added by okurz 6 months ago.
Updated 6 months ago.
Category:
Feature requests
Description
Motivation¶
#162332-7 OSD was encountered as unbootable, possibly due to filesystem problems in the big volumes like vdb,vdc,vdd,vde . We should treat those mount points as non-critical for boot but make critical services like openQA depend on them so that openQA would only start up with a consistent set of mount points. If possible still trigger the automatic filesystem checks but make them non-critical for making the system reachable over network
Acceptance criteria¶
- AC1: OSD is reachable over SSH after reboot regardless of problems with vd[a-e]
- AC2: openQA services are prevented from starting if not all relevant mount points are available
Suggestions¶
- DONE Add "nofail" to all non-root partitions in /etc/fstab
- DONE Add
RequiresMountsFor
for all relevant high-level services, e.g. openQA, to prevent openQA starting and writing incomplete data
- Ensure that a system like that can boot with all partitions in good state
- Ensure that a system like that can boot at least reachable over ssh with at least one partition in non-good or absent state, e.g. simulate on OSD directly with
qemu-system-x86_64 -m 8192 -snapshot -hda /dev/vda -nographic -serial mon:stdio -smp 4
- Copied from action #162353: Ensure consistent known root password on all OSD webUI+workers size:S added
- Copied to action #162359: Change OSD root to more modern filesystem mount options size:S added
- Assignee set to nicksinger
- Priority changed from Normal to High
- Target version changed from Tools - Next to Ready
- Due date set to 2024-07-01
- Status changed from New to Feedback
- Status changed from Feedback to In Progress
I haven't conducted a full test-matrix with different disks missing but can at least ensure that my changes somehow bring up the machine if nothing besides the root disk is present. The biggest other problem seems to be auditd.service which tries to start 30 times and always failing at this step:
Jun 17 19:53:37 openqa systemd[1]: Starting Security Auditing Service...
Jun 17 19:53:38 openqa auditd[622]: Could not open dir /var/log/audit (No such file or directory)
Jun 17 19:53:38 openqa auditd[622]: The audit daemon is exiting.
Jun 17 19:53:38 openqa systemd[1]: auditd.service: Control process exited, code=exited, status=6/NOTCONFIGURED
Jun 17 19:53:39 openqa auditctl[623]: enabled 0
Jun 17 19:53:39 openqa auditctl[623]: failure 1
Jun 17 19:53:39 openqa auditctl[623]: pid 0
Jun 17 19:53:39 openqa auditctl[623]: rate_limit 0
Jun 17 19:53:39 openqa auditctl[623]: backlog_limit 64
Jun 17 19:53:39 openqa auditctl[623]: lost 0
Jun 17 19:53:39 openqa auditctl[623]: backlog 0
Jun 17 19:53:39 openqa auditctl[623]: backlog_wait_time 15000
Jun 17 19:53:39 openqa auditctl[623]: backlog_wait_time_actual 0
Jun 17 19:53:39 openqa auditctl[623]: No rules
Jun 17 19:53:39 openqa systemd[1]: auditd.service: Failed with result 'exit-code'.
Jun 17 19:53:39 openqa systemd[1]: Failed to start Security Auditing Service.
Jun 17 19:53:39 openqa systemd[1]: auditd.service: Scheduled restart job, restart counter is at 6.
Jun 17 19:53:39 openqa systemd[1]: Stopped Security Auditing Service.
which seems to happen because our /var/log
is a symlink pointing into /srv/log/
. I have to think about a solution or maybe we just keep it because eventually auditd aborts and the system boot continues until login (what we actually want to archive).
nicksinger wrote in #note-5:
which seems to happen because our /var/log
is a symlink pointing into /srv/log/
. I have to think about a solution or maybe we just keep it because eventually auditd aborts and the system boot continues until login (what we actually want to archive).
Maybe adding a new systemd dependency on /srv/log being mounted for auditd would help?
- Status changed from In Progress to Workable
- Assignee deleted (
nicksinger)
- Priority changed from High to Normal
I've created an MR to adjust auditd https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1212
I think we now have the main parts in place to lower the prio as we can now safely restart OSD if any disk fails.
However, we still need to adjust how systemd behaves if this happens. Currently the system starts up fine but only puts some services (nginx, postgres) into their failed state (systemctl --failed
). This is confusing and not really helpful for the uninitiated person debugging this. Systemd clearly realizes that something went wrong: openqa-webui.service: Job openqa-webui.service/start failed with result 'dependency'.
but the service is put into inactive (dead)
. I failed to understand why this happens and how this behavior can be changed (without blocking the complete bootchain again).
- Status changed from Workable to New
- Subject changed from Treat OSD non-root mounts as non-critical for boot to Treat OSD non-root mounts as non-critical for boot size:M
- Description updated (diff)
- Status changed from New to Workable
- Due date deleted (
2024-07-01)
- Start date deleted (
2024-06-17)
- Status changed from Workable to Resolved
- Assignee set to nicksinger
So both ACs are fulfilled as far as we can verify without needing to artificially impact OSD itself. I verified with virtual machines and could see that relevant services are not started and I think the current behaviour of services not being started and ending up "inactive" or "dead" is what we can get but no better. That should suffice. Resolving and assigning back to nicksinger who did most of the relevant work.
Also available in: Atom
PDF