action #162362: 2024-06-15 osd not accessible - ensure healthy filesystems size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #162362

closed

2024-06-15 osd not accessible - ensure healthy filesystems size:S

Added by okurz 11 months ago. Updated 10 months ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Feature requests

Target version:

openQA Project (public) - Ready

Start date:

2024-06-17

Due date:

% Done:

Estimated time:

Tags:

infra

Description

Motivation¶

#162332-7 / Check the filesystems from the running OS without rebooting to check if there are errors. If there are then gracefully and announced shut down the openQA services and fix the problems from the running OS and only after ensuring cleanness trigger reboots

Acceptance criteria¶

AC1: All 5 storage devices on OSD report a clean filesystem integrity

Suggestions¶

Run for i in b c d e; do xfs_repair -m 4096 -n /dev/vd$i; done on OSD to check. On any found problems try to keep services running but in read-only mode like we did some time in the past, at least stop openqa-scheduler and such, and run xfs_repair without -n on the according storage devices during off-times with pre-announcements e.g. Thursday during the maintenance window

Rollback steps¶

DONE re-enable cron service for OSD in openqa-service in /etc/crontab

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz 11 months ago

Copied from action #162359: Change OSD root to more modern filesystem mount options size:S added

Actions

Copy link

Updated by okurz 11 months ago

Copied to action #162365: OSD can fail on xfs_repair OOM conditions size:S added

Actions

Copy link

Updated by livdywan 11 months ago

Subject changed from 2024-06-15 osd not accessible - ensure healthy filesystems to 2024-06-15 osd not accessible - ensure healthy filesystems size:S
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by mkittler 11 months ago

Status changed from Workable to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler 11 months ago

Status changed from In Progress to Feedback

The checking also already requires the filesystems to be unmounted (or at least mounted read-only):

for i in b c d e; do echo "testing /dev/vd$i" ; sudo xfs_repair -m 4096 -n /dev/vd$i; done
testing /dev/vdb
xfs_repair: /dev/vdb contains a mounted and writable filesystem

fatal error -- couldn't initialize XFS library
testing /dev/vdc
xfs_repair: /dev/vdc contains a mounted and writable filesystem

fatal error -- couldn't initialize XFS library
testing /dev/vdd
xfs_repair: /dev/vdd contains a mounted and writable filesystem

fatal error -- couldn't initialize XFS library
testing /dev/vde
xfs_repair: /dev/vde contains a mounted and writable filesystem

fatal error -- couldn't initialize XFS library

I suppose to check vdb we need to stop all the services as it contains the database. (I suppose systemctl stop srv.mount will do that for us. Probably systemctl stop nginx would make sense as well.)

vdc, vdd, and vde contain assets, results and fixed assets and archived jobs respectively. So stopping the scheduler wouldn't be enough; we also have to stop all jobs to prevent downloading/uploading errors. Probably it is also the cleanest to just stop the services. If we're quick enough jobs might not be actually impacted and continue after retrying.

Now we just need to find a good time to do this. Maybe today in the late afternoon?

Actions

Copy link

Updated by mkittler 11 months ago · Edited

Status changed from Feedback to In Progress

vdb is tricky as we also write the system journal there. I also had to stop salt. So I'll try the other ones first. (To get PostgreSQL running again I had to run sudo systemctl start var-lib-pgsql.mount first.)

EDIT: I couldn't even unmount /srv/space-slow because it was used by /results/share/factory/hdd/fixed but I could not unmount that (umount: /results/share/factory/hdd/fixed: target is busy.) despite lsof /results/share/factory/hdd/fixed not showing anything.

Somehow I don't think we can do this with just a short interruption of the production services. We should probably boot the machine cleanly into some minimal target (not sure yet how to do that) when there is a good/non-busy window. Or we should simply let the filesystem check run on the next. In any case, we should probably have someone with hypervisor access be ready¹.

¹ Note that I was already afraid I lost access just now because systemctl stop srv.mount kicked me out of the SSH session. I find that very strange because the command did actually nothing. I could just login again and /srv was also not unmounted because it was busy anyway. It also didn't cause a restart of sshd.service because as of now it has still been running for over a day. I invoked systemctl stop srv.mount several times and it always kicked me out but luckily without further consequences.

Actions

Copy link

Updated by mkittler 11 months ago

Status changed from In Progress to Workable

Actions

Copy link

Updated by okurz 11 months ago

seemingly forgot to re-enable openqa-scheduler again triggering an alert. Started the scheduler again, should be good.

Actions

Copy link

Updated by okurz 11 months ago

Due date set to 2024-07-05
Status changed from Workable to In Progress
Assignee changed from mkittler to okurz

giving this another try. In a screen session unmounted vd{c,d,e}, stopped services and running in parallel

xfs_repair -m 4096 /dev/vdc
xfs_repair -m 4096 /dev/vdd
xfs_repair -m 4096 /dev/vde

vdc finished quickly after about 5m.

Actions

Copy link

#10

Updated by okurz 11 months ago

Description updated (diff)

Actions

Copy link

#11

Updated by okurz 11 months ago

Description updated (diff)

vdd finished after 1.5h. Remounted all and started services.

Called

for i in failed incomplete parallel_failed; do host=openqa.suse.de failed_since="2024-06-21 20:00" result="result='$i'" additional_filters="reason like '%502%'" comment="label:poo#162362" ./openqa-advanced-retrigger-jobs; done

I found that within the running OS systemd-fsck@dev-disk-by\x2duuid-b5377fcf\x2d6273\x2d4f38\x2da471\x2dcefff08c60b7.service was started and failed recurringly as I was running an explicit foreground xfs_repair which made the device busy. This might mean that the filesystem for vdd is still treated as needing repairs.

I guess we need to consider disabling the filesystem checks requested by /etc/fstab and plan to do them on our own explicitly.

Actions

Copy link

#12

Updated by jbaier_cz 11 months ago

okurz wrote in #note-11:

I guess we need to consider disabling the filesystem checks requested by /etc/fstab and plan to do them on our own explicitly.

Yes, that is actually the same idea I was able to find out in #162365#note-11.

Actions

Copy link

#13

Updated by okurz 10 months ago

Due date deleted (~~2024-07-05~~)
Status changed from In Progress to Resolved

I found https://www.2daygeek.com/repairing-xfs-file-system-in-rhel/ explaining how to artificially corrupt filesystems for experimentation purposes. But not following with that for now.

I tried xfs_scrub but that fails with

EXPERIMENTAL xfs_scrub program in use! Use at your own risk!
Error: /srv: Kernel metadata scrubbing facility is not available.
Info: /srv: Scrub aborted after phase 1.
/srv: operational errors found: 1

same for other partitions so this is of no help.

dmesg | grep XFS reports no problems

[    8.584115] SGI XFS with ACLs, security attributes, quota, no debug enabled
[    8.717693] XFS (vdb): Mounting V5 Filesystem
[    8.814670] XFS (vdb): Ending clean mount
[   76.995026] XFS (vdc): Mounting V5 Filesystem
[   77.064424] XFS (vdc): Ending clean mount
[  314.123230] XFS (vdd): Mounting V5 Filesystem
[  315.590921] XFS (vdd): Ending clean mount
[  316.842499] XFS (vde): Mounting V5 Filesystem
[  316.909011] XFS (vde): Ending clean mount
[184978.697206] XFS (vde): Unmounting Filesystem
[185036.712008] XFS (vdd): Unmounting Filesystem
[185094.638973] XFS (vdc): Unmounting Filesystem
[187443.039406] XFS (vdc): Mounting V5 Filesystem
[187443.162253] XFS (vdc): Ending clean mount
[187445.350033] XFS (vde): Mounting V5 Filesystem
[187445.403976] XFS (vde): Ending clean mount
[191624.557251] XFS (vdd): Mounting V5 Filesystem
[191625.868496] XFS (vdd): Ending clean mount

so I assume that means that we are safe.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #162362

2024-06-15 osd not accessible - ensure healthy filesystems size:S

Motivation¶

Acceptance criteria¶

Suggestions¶

Rollback steps¶

Updated by okurz 11 months ago

Updated by okurz 11 months ago

Updated by livdywan 11 months ago

Updated by mkittler 11 months ago

Updated by mkittler 11 months ago

Updated by mkittler 11 months ago · Edited

Updated by mkittler 11 months ago

Updated by okurz 11 months ago

Updated by okurz 11 months ago

Updated by okurz 11 months ago

Updated by okurz 11 months ago

Updated by jbaier_cz 11 months ago

Updated by okurz 10 months ago