action #162356: Treat OSD non-root mounts as non-critical for boot size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #162356

closed

Treat OSD non-root mounts as non-critical for boot size:M

Added by okurz 6 months ago. Updated 6 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

nicksinger

Category:

Feature requests

Target version:

openQA Project (public) - Ready

Start date:

Due date:

% Done:

Estimated time:

Tags:

infra

Description

Motivation¶

#162332-7 OSD was encountered as unbootable, possibly due to filesystem problems in the big volumes like vdb,vdc,vdd,vde . We should treat those mount points as non-critical for boot but make critical services like openQA depend on them so that openQA would only start up with a consistent set of mount points. If possible still trigger the automatic filesystem checks but make them non-critical for making the system reachable over network

Acceptance criteria¶

AC1: OSD is reachable over SSH after reboot regardless of problems with vd[a-e]
AC2: openQA services are prevented from starting if not all relevant mount points are available

Suggestions¶

DONE Add "nofail" to all non-root partitions in /etc/fstab
DONE Add RequiresMountsFor for all relevant high-level services, e.g. openQA, to prevent openQA starting and writing incomplete data
Ensure that a system like that can boot with all partitions in good state
Ensure that a system like that can boot at least reachable over ssh with at least one partition in non-good or absent state, e.g. simulate on OSD directly with qemu-system-x86_64 -m 8192 -snapshot -hda /dev/vda -nographic -serial mon:stdio -smp 4

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz 6 months ago

Copied from action #162353: Ensure consistent known root password on all OSD webUI+workers size:S added

Actions

Copy link

Updated by okurz 6 months ago

Copied to action #162359: Change OSD root to more modern filesystem mount options size:S added

Actions

Copy link

Updated by okurz 6 months ago

Assignee set to nicksinger
Priority changed from Normal to High
Target version changed from Tools - Next to Ready

I assume @nicksinger is working on that.

Actions

Copy link

Updated by okurz 6 months ago

Due date set to 2024-07-01
Status changed from New to Feedback

This is how we tested from OSD how the system handles if e.g. additional disks are completely absent

qemu-system-x86_64 -m 8192 -snapshot -hda /dev/vda -nographic -serial mon:stdio -smp 4

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1207 is the corresponding MR by nicksinger, merged.

Actions

Copy link

Updated by nicksinger 6 months ago

Status changed from Feedback to In Progress

I haven't conducted a full test-matrix with different disks missing but can at least ensure that my changes somehow bring up the machine if nothing besides the root disk is present. The biggest other problem seems to be auditd.service which tries to start 30 times and always failing at this step:

Jun 17 19:53:37 openqa systemd[1]: Starting Security Auditing Service...
Jun 17 19:53:38 openqa auditd[622]: Could not open dir /var/log/audit (No such file or directory)
Jun 17 19:53:38 openqa auditd[622]: The audit daemon is exiting.
Jun 17 19:53:38 openqa systemd[1]: auditd.service: Control process exited, code=exited, status=6/NOTCONFIGURED
Jun 17 19:53:39 openqa auditctl[623]: enabled 0
Jun 17 19:53:39 openqa auditctl[623]: failure 1
Jun 17 19:53:39 openqa auditctl[623]: pid 0
Jun 17 19:53:39 openqa auditctl[623]: rate_limit 0
Jun 17 19:53:39 openqa auditctl[623]: backlog_limit 64
Jun 17 19:53:39 openqa auditctl[623]: lost 0
Jun 17 19:53:39 openqa auditctl[623]: backlog 0
Jun 17 19:53:39 openqa auditctl[623]: backlog_wait_time 15000
Jun 17 19:53:39 openqa auditctl[623]: backlog_wait_time_actual 0
Jun 17 19:53:39 openqa auditctl[623]: No rules
Jun 17 19:53:39 openqa systemd[1]: auditd.service: Failed with result 'exit-code'.
Jun 17 19:53:39 openqa systemd[1]: Failed to start Security Auditing Service.
Jun 17 19:53:39 openqa systemd[1]: auditd.service: Scheduled restart job, restart counter is at 6.
Jun 17 19:53:39 openqa systemd[1]: Stopped Security Auditing Service.

which seems to happen because our /var/log is a symlink pointing into /srv/log/. I have to think about a solution or maybe we just keep it because eventually auditd aborts and the system boot continues until login (what we actually want to archive).

Actions

Copy link

Updated by jbaier_cz 6 months ago

nicksinger wrote in #note-5:

which seems to happen because our /var/log is a symlink pointing into /srv/log/. I have to think about a solution or maybe we just keep it because eventually auditd aborts and the system boot continues until login (what we actually want to archive).

Maybe adding a new systemd dependency on /srv/log being mounted for auditd would help?

Actions

Copy link

Updated by nicksinger 6 months ago

Status changed from In Progress to Workable
Assignee deleted (~~nicksinger~~)
Priority changed from High to Normal

I've created an MR to adjust auditd https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1212
I think we now have the main parts in place to lower the prio as we can now safely restart OSD if any disk fails.

However, we still need to adjust how systemd behaves if this happens. Currently the system starts up fine but only puts some services (nginx, postgres) into their failed state (systemctl --failed). This is confusing and not really helpful for the uninitiated person debugging this. Systemd clearly realizes that something went wrong: openqa-webui.service: Job openqa-webui.service/start failed with result 'dependency'. but the service is put into inactive (dead). I failed to understand why this happens and how this behavior can be changed (without blocking the complete bootchain again).

Actions

Copy link

Updated by okurz 6 months ago

Status changed from Workable to New

Actions

Copy link

Updated by livdywan 6 months ago

Subject changed from Treat OSD non-root mounts as non-critical for boot to Treat OSD non-root mounts as non-critical for boot size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

#10

Updated by livdywan 6 months ago

Due date deleted (~~2024-07-01~~)
Start date deleted (~~2024-06-17~~)

Actions

Copy link

#11

Updated by okurz 6 months ago

Status changed from Workable to Resolved
Assignee set to nicksinger

So both ACs are fulfilled as far as we can verify without needing to artificially impact OSD itself. I verified with virtual machines and could see that relevant services are not started and I think the current behaviour of services not being started and ending up "inactive" or "dead" is what we can get but no better. That should suffice. Resolving and assigning back to nicksinger who did most of the relevant work.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #162356

Treat OSD non-root mounts as non-critical for boot size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz 6 months ago

Updated by okurz 6 months ago

Updated by okurz 6 months ago

Updated by okurz 6 months ago

Updated by nicksinger 6 months ago

Updated by jbaier_cz 6 months ago

Updated by nicksinger 6 months ago

Updated by okurz 6 months ago

Updated by livdywan 6 months ago

Updated by livdywan 6 months ago

Updated by okurz 6 months ago