Project

General

Profile

Actions

action #162356

closed

Treat OSD non-root mounts as non-critical for boot size:M

Added by okurz 6 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

#162332-7 OSD was encountered as unbootable, possibly due to filesystem problems in the big volumes like vdb,vdc,vdd,vde . We should treat those mount points as non-critical for boot but make critical services like openQA depend on them so that openQA would only start up with a consistent set of mount points. If possible still trigger the automatic filesystem checks but make them non-critical for making the system reachable over network

Acceptance criteria

  • AC1: OSD is reachable over SSH after reboot regardless of problems with vd[a-e]
  • AC2: openQA services are prevented from starting if not all relevant mount points are available

Suggestions

  • DONE Add "nofail" to all non-root partitions in /etc/fstab
  • DONE Add RequiresMountsFor for all relevant high-level services, e.g. openQA, to prevent openQA starting and writing incomplete data
  • Ensure that a system like that can boot with all partitions in good state
  • Ensure that a system like that can boot at least reachable over ssh with at least one partition in non-good or absent state, e.g. simulate on OSD directly with qemu-system-x86_64 -m 8192 -snapshot -hda /dev/vda -nographic -serial mon:stdio -smp 4

Related issues 2 (0 open2 closed)

Copied from openQA Infrastructure (public) - action #162353: Ensure consistent known root password on all OSD webUI+workers size:SResolvednicksinger2024-06-17

Actions
Copied to openQA Infrastructure (public) - action #162359: Change OSD root to more modern filesystem mount options size:SResolvedrobert.richardson2024-06-17

Actions
Actions #1

Updated by okurz 6 months ago

  • Copied from action #162353: Ensure consistent known root password on all OSD webUI+workers size:S added
Actions #2

Updated by okurz 6 months ago

  • Copied to action #162359: Change OSD root to more modern filesystem mount options size:S added
Actions #3

Updated by okurz 6 months ago

  • Assignee set to nicksinger
  • Priority changed from Normal to High
  • Target version changed from Tools - Next to Ready

I assume @nicksinger is working on that.

Actions #4

Updated by okurz 6 months ago

  • Due date set to 2024-07-01
  • Status changed from New to Feedback

This is how we tested from OSD how the system handles if e.g. additional disks are completely absent

qemu-system-x86_64 -m 8192 -snapshot -hda /dev/vda -nographic -serial mon:stdio -smp 4

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1207 is the corresponding MR by nicksinger, merged.

Actions #5

Updated by nicksinger 6 months ago

  • Status changed from Feedback to In Progress

I haven't conducted a full test-matrix with different disks missing but can at least ensure that my changes somehow bring up the machine if nothing besides the root disk is present. The biggest other problem seems to be auditd.service which tries to start 30 times and always failing at this step:

Jun 17 19:53:37 openqa systemd[1]: Starting Security Auditing Service...
Jun 17 19:53:38 openqa auditd[622]: Could not open dir /var/log/audit (No such file or directory)
Jun 17 19:53:38 openqa auditd[622]: The audit daemon is exiting.
Jun 17 19:53:38 openqa systemd[1]: auditd.service: Control process exited, code=exited, status=6/NOTCONFIGURED
Jun 17 19:53:39 openqa auditctl[623]: enabled 0
Jun 17 19:53:39 openqa auditctl[623]: failure 1
Jun 17 19:53:39 openqa auditctl[623]: pid 0
Jun 17 19:53:39 openqa auditctl[623]: rate_limit 0
Jun 17 19:53:39 openqa auditctl[623]: backlog_limit 64
Jun 17 19:53:39 openqa auditctl[623]: lost 0
Jun 17 19:53:39 openqa auditctl[623]: backlog 0
Jun 17 19:53:39 openqa auditctl[623]: backlog_wait_time 15000
Jun 17 19:53:39 openqa auditctl[623]: backlog_wait_time_actual 0
Jun 17 19:53:39 openqa auditctl[623]: No rules
Jun 17 19:53:39 openqa systemd[1]: auditd.service: Failed with result 'exit-code'.
Jun 17 19:53:39 openqa systemd[1]: Failed to start Security Auditing Service.
Jun 17 19:53:39 openqa systemd[1]: auditd.service: Scheduled restart job, restart counter is at 6.
Jun 17 19:53:39 openqa systemd[1]: Stopped Security Auditing Service.

which seems to happen because our /var/log is a symlink pointing into /srv/log/. I have to think about a solution or maybe we just keep it because eventually auditd aborts and the system boot continues until login (what we actually want to archive).

Actions #6

Updated by jbaier_cz 6 months ago

nicksinger wrote in #note-5:

which seems to happen because our /var/log is a symlink pointing into /srv/log/. I have to think about a solution or maybe we just keep it because eventually auditd aborts and the system boot continues until login (what we actually want to archive).

Maybe adding a new systemd dependency on /srv/log being mounted for auditd would help?

Actions #7

Updated by nicksinger 6 months ago

  • Status changed from In Progress to Workable
  • Assignee deleted (nicksinger)
  • Priority changed from High to Normal

I've created an MR to adjust auditd https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1212
I think we now have the main parts in place to lower the prio as we can now safely restart OSD if any disk fails.

However, we still need to adjust how systemd behaves if this happens. Currently the system starts up fine but only puts some services (nginx, postgres) into their failed state (systemctl --failed). This is confusing and not really helpful for the uninitiated person debugging this. Systemd clearly realizes that something went wrong: openqa-webui.service: Job openqa-webui.service/start failed with result 'dependency'. but the service is put into inactive (dead). I failed to understand why this happens and how this behavior can be changed (without blocking the complete bootchain again).

Actions #8

Updated by okurz 6 months ago

  • Status changed from Workable to New
Actions #9

Updated by livdywan 6 months ago

  • Subject changed from Treat OSD non-root mounts as non-critical for boot to Treat OSD non-root mounts as non-critical for boot size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #10

Updated by livdywan 6 months ago

  • Due date deleted (2024-07-01)
  • Start date deleted (2024-06-17)
Actions #11

Updated by okurz 6 months ago

  • Status changed from Workable to Resolved
  • Assignee set to nicksinger

So both ACs are fulfilled as far as we can verify without needing to artificially impact OSD itself. I verified with virtual machines and could see that relevant services are not started and I think the current behaviour of services not being started and ending up "inactive" or "dead" is what we can get but no better. That should suffice. Resolving and assigning back to nicksinger who did most of the relevant work.

Actions

Also available in: Atom PDF