action #162362
closed
2024-06-15 osd not accessible - ensure healthy filesystems size:S
Added by okurz 6 months ago.
Updated 6 months ago.
Category:
Feature requests
Description
Motivation¶
#162332-7 / Check the filesystems from the running OS without rebooting to check if there are errors. If there are then gracefully and announced shut down the openQA services and fix the problems from the running OS and only after ensuring cleanness trigger reboots
Acceptance criteria¶
- AC1: All 5 storage devices on OSD report a clean filesystem integrity
Suggestions¶
- Run
for i in b c d e; do xfs_repair -m 4096 -n /dev/vd$i; done
on OSD to check. On any found problems try to keep services running but in read-only mode like we did some time in the past, at least stop openqa-scheduler and such, and run xfs_repair without -n on the according storage devices during off-times with pre-announcements e.g. Thursday during the maintenance window
Rollback steps¶
- DONE re-enable cron service for OSD in openqa-service in /etc/crontab
- Copied from action #162359: Change OSD root to more modern filesystem mount options size:S added
- Copied to action #162365: OSD can fail on xfs_repair OOM conditions size:S added
- Subject changed from 2024-06-15 osd not accessible - ensure healthy filesystems to 2024-06-15 osd not accessible - ensure healthy filesystems size:S
- Description updated (diff)
- Status changed from New to Workable
- Status changed from Workable to In Progress
- Assignee set to mkittler
- Status changed from In Progress to Feedback
The checking also already requires the filesystems to be unmounted (or at least mounted read-only):
for i in b c d e; do echo "testing /dev/vd$i" ; sudo xfs_repair -m 4096 -n /dev/vd$i; done
testing /dev/vdb
xfs_repair: /dev/vdb contains a mounted and writable filesystem
fatal error -- couldn't initialize XFS library
testing /dev/vdc
xfs_repair: /dev/vdc contains a mounted and writable filesystem
fatal error -- couldn't initialize XFS library
testing /dev/vdd
xfs_repair: /dev/vdd contains a mounted and writable filesystem
fatal error -- couldn't initialize XFS library
testing /dev/vde
xfs_repair: /dev/vde contains a mounted and writable filesystem
fatal error -- couldn't initialize XFS library
I suppose to check vdb we need to stop all the services as it contains the database. (I suppose systemctl stop srv.mount
will do that for us. Probably systemctl stop nginx
would make sense as well.)
vdc, vdd, and vde contain assets, results and fixed assets and archived jobs respectively. So stopping the scheduler wouldn't be enough; we also have to stop all jobs to prevent downloading/uploading errors. Probably it is also the cleanest to just stop the services. If we're quick enough jobs might not be actually impacted and continue after retrying.
Now we just need to find a good time to do this. Maybe today in the late afternoon?
- Status changed from Feedback to In Progress
vdb is tricky as we also write the system journal there. I also had to stop salt. So I'll try the other ones first. (To get PostgreSQL running again I had to run sudo systemctl start var-lib-pgsql.mount
first.)
EDIT: I couldn't even unmount /srv/space-slow
because it was used by /results/share/factory/hdd/fixed
but I could not unmount that (umount: /results/share/factory/hdd/fixed: target is busy.
) despite lsof /results/share/factory/hdd/fixed
not showing anything.
Somehow I don't think we can do this with just a short interruption of the production services. We should probably boot the machine cleanly into some minimal target (not sure yet how to do that) when there is a good/non-busy window. Or we should simply let the filesystem check run on the next. In any case, we should probably have someone with hypervisor access be ready¹.
¹ Note that I was already afraid I lost access just now because systemctl stop srv.mount
kicked me out of the SSH session. I find that very strange because the command did actually nothing. I could just login again and /srv
was also not unmounted because it was busy anyway. It also didn't cause a restart of sshd.service
because as of now it has still been running for over a day. I invoked systemctl stop srv.mount
several times and it always kicked me out but luckily without further consequences.
- Status changed from In Progress to Workable
seemingly forgot to re-enable openqa-scheduler again triggering an alert. Started the scheduler again, should be good.
- Due date set to 2024-07-05
- Status changed from Workable to In Progress
- Assignee changed from mkittler to okurz
giving this another try. In a screen session unmounted vd{c,d,e}, stopped services and running in parallel
xfs_repair -m 4096 /dev/vdc
xfs_repair -m 4096 /dev/vdd
xfs_repair -m 4096 /dev/vde
vdc finished quickly after about 5m.
- Description updated (diff)
- Description updated (diff)
vdd finished after 1.5h. Remounted all and started services.
Called
for i in failed incomplete parallel_failed; do host=openqa.suse.de failed_since="2024-06-21 20:00" result="result='$i'" additional_filters="reason like '%502%'" comment="label:poo#162362" ./openqa-advanced-retrigger-jobs; done
I found that within the running OS systemd-fsck@dev-disk-by\x2duuid-b5377fcf\x2d6273\x2d4f38\x2da471\x2dcefff08c60b7.service was started and failed recurringly as I was running an explicit foreground xfs_repair which made the device busy. This might mean that the filesystem for vdd is still treated as needing repairs.
I guess we need to consider disabling the filesystem checks requested by /etc/fstab and plan to do them on our own explicitly.
okurz wrote in #note-11:
I guess we need to consider disabling the filesystem checks requested by /etc/fstab and plan to do them on our own explicitly.
Yes, that is actually the same idea I was able to find out in #162365#note-11.
- Due date deleted (
2024-07-05)
- Status changed from In Progress to Resolved
I found https://www.2daygeek.com/repairing-xfs-file-system-in-rhel/ explaining how to artificially corrupt filesystems for experimentation purposes. But not following with that for now.
I tried xfs_scrub but that fails with
EXPERIMENTAL xfs_scrub program in use! Use at your own risk!
Error: /srv: Kernel metadata scrubbing facility is not available.
Info: /srv: Scrub aborted after phase 1.
/srv: operational errors found: 1
same for other partitions so this is of no help.
dmesg | grep XFS
reports no problems
[ 8.584115] SGI XFS with ACLs, security attributes, quota, no debug enabled
[ 8.717693] XFS (vdb): Mounting V5 Filesystem
[ 8.814670] XFS (vdb): Ending clean mount
[ 76.995026] XFS (vdc): Mounting V5 Filesystem
[ 77.064424] XFS (vdc): Ending clean mount
[ 314.123230] XFS (vdd): Mounting V5 Filesystem
[ 315.590921] XFS (vdd): Ending clean mount
[ 316.842499] XFS (vde): Mounting V5 Filesystem
[ 316.909011] XFS (vde): Ending clean mount
[184978.697206] XFS (vde): Unmounting Filesystem
[185036.712008] XFS (vdd): Unmounting Filesystem
[185094.638973] XFS (vdc): Unmounting Filesystem
[187443.039406] XFS (vdc): Mounting V5 Filesystem
[187443.162253] XFS (vdc): Ending clean mount
[187445.350033] XFS (vde): Mounting V5 Filesystem
[187445.403976] XFS (vde): Ending clean mount
[191624.557251] XFS (vdd): Mounting V5 Filesystem
[191625.868496] XFS (vdd): Ending clean mount
so I assume that means that we are safe.
Also available in: Atom
PDF