Project

General

Profile

Actions

action #162356

open

Treat OSD non-root mounts as non-critical for boot size:M

Added by okurz 13 days ago. Updated 10 days ago.

Status:
Workable
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2024-06-17
Due date:
2024-07-01 (Due in 1 day)
% Done:

0%

Estimated time:
Tags:

Description

Motivation

#162332-7 OSD was encountered as unbootable, possibly due to filesystem problems in the big volumes like vdb,vdc,vdd,vde . We should treat those mount points as non-critical for boot but make critical services like openQA depend on them so that openQA would only start up with a consistent set of mount points. If possible still trigger the automatic filesystem checks but make them non-critical for making the system reachable over network

Acceptance criteria

  • AC1: OSD is reachable over SSH after reboot regardless of problems with vd[a-e]
  • AC2: openQA services are prevented from starting if not all relevant mount points are available

Suggestions

  • DONE Add "nofail" to all non-root partitions in /etc/fstab
  • DONE Add RequiresMountsFor for all relevant high-level services, e.g. openQA, to prevent openQA starting and writing incomplete data
  • Ensure that a system like that can boot with all partitions in good state
  • Ensure that a system like that can boot at least reachable over ssh with at least one partition in non-good or absent state, e.g. simulate on OSD directly with qemu-system-x86_64 -m 8192 -snapshot -hda /dev/vda -nographic -serial mon:stdio -smp 4

Related issues 2 (2 open0 closed)

Copied from openQA Infrastructure - action #162353: Ensure consistent known root password on all OSD webUI+workersNew2024-06-17

Actions
Copied to openQA Infrastructure - action #162359: Change OSD root to more modern filesystem mount optionsNew2024-06-17

Actions
Actions #1

Updated by okurz 13 days ago

  • Copied from action #162353: Ensure consistent known root password on all OSD webUI+workers added
Actions #2

Updated by okurz 13 days ago

  • Copied to action #162359: Change OSD root to more modern filesystem mount options added
Actions #3

Updated by okurz 13 days ago

  • Assignee set to nicksinger
  • Priority changed from Normal to High
  • Target version changed from Tools - Next to Ready

I assume @nicksinger is working on that.

Actions #4

Updated by okurz 13 days ago

  • Due date set to 2024-07-01
  • Status changed from New to Feedback

This is how we tested from OSD how the system handles if e.g. additional disks are completely absent

qemu-system-x86_64 -m 8192 -snapshot -hda /dev/vda -nographic -serial mon:stdio -smp 4

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1207 is the corresponding MR by nicksinger, merged.

Actions #5

Updated by nicksinger 13 days ago

  • Status changed from Feedback to In Progress

I haven't conducted a full test-matrix with different disks missing but can at least ensure that my changes somehow bring up the machine if nothing besides the root disk is present. The biggest other problem seems to be auditd.service which tries to start 30 times and always failing at this step:

Jun 17 19:53:37 openqa systemd[1]: Starting Security Auditing Service...
Jun 17 19:53:38 openqa auditd[622]: Could not open dir /var/log/audit (No such file or directory)
Jun 17 19:53:38 openqa auditd[622]: The audit daemon is exiting.
Jun 17 19:53:38 openqa systemd[1]: auditd.service: Control process exited, code=exited, status=6/NOTCONFIGURED
Jun 17 19:53:39 openqa auditctl[623]: enabled 0
Jun 17 19:53:39 openqa auditctl[623]: failure 1
Jun 17 19:53:39 openqa auditctl[623]: pid 0
Jun 17 19:53:39 openqa auditctl[623]: rate_limit 0
Jun 17 19:53:39 openqa auditctl[623]: backlog_limit 64
Jun 17 19:53:39 openqa auditctl[623]: lost 0
Jun 17 19:53:39 openqa auditctl[623]: backlog 0
Jun 17 19:53:39 openqa auditctl[623]: backlog_wait_time 15000
Jun 17 19:53:39 openqa auditctl[623]: backlog_wait_time_actual 0
Jun 17 19:53:39 openqa auditctl[623]: No rules
Jun 17 19:53:39 openqa systemd[1]: auditd.service: Failed with result 'exit-code'.
Jun 17 19:53:39 openqa systemd[1]: Failed to start Security Auditing Service.
Jun 17 19:53:39 openqa systemd[1]: auditd.service: Scheduled restart job, restart counter is at 6.
Jun 17 19:53:39 openqa systemd[1]: Stopped Security Auditing Service.

which seems to happen because our /var/log is a symlink pointing into /srv/log/. I have to think about a solution or maybe we just keep it because eventually auditd aborts and the system boot continues until login (what we actually want to archive).

Actions #6

Updated by jbaier_cz 13 days ago

nicksinger wrote in #note-5:

which seems to happen because our /var/log is a symlink pointing into /srv/log/. I have to think about a solution or maybe we just keep it because eventually auditd aborts and the system boot continues until login (what we actually want to archive).

Maybe adding a new systemd dependency on /srv/log being mounted for auditd would help?

Actions #7

Updated by nicksinger 12 days ago

  • Status changed from In Progress to Workable
  • Assignee deleted (nicksinger)
  • Priority changed from High to Normal

I've created an MR to adjust auditd https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1212
I think we now have the main parts in place to lower the prio as we can now safely restart OSD if any disk fails.

However, we still need to adjust how systemd behaves if this happens. Currently the system starts up fine but only puts some services (nginx, postgres) into their failed state (systemctl --failed). This is confusing and not really helpful for the uninitiated person debugging this. Systemd clearly realizes that something went wrong: openqa-webui.service: Job openqa-webui.service/start failed with result 'dependency'. but the service is put into inactive (dead). I failed to understand why this happens and how this behavior can be changed (without blocking the complete bootchain again).

Actions #8

Updated by okurz 10 days ago

  • Status changed from Workable to New
Actions #9

Updated by livdywan 10 days ago

  • Subject changed from Treat OSD non-root mounts as non-critical for boot to Treat OSD non-root mounts as non-critical for boot size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions

Also available in: Atom PDF