Project

General

Profile

Actions

action #162362

closed

2024-06-15 osd not accessible - ensure healthy filesystems size:S

Added by okurz 13 days ago. Updated 5 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2024-06-17
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

#162332-7 / Check the filesystems from the running OS without rebooting to check if there are errors. If there are then gracefully and announced shut down the openQA services and fix the problems from the running OS and only after ensuring cleanness trigger reboots

Acceptance criteria

  • AC1: All 5 storage devices on OSD report a clean filesystem integrity

Suggestions

  • Run for i in b c d e; do xfs_repair -m 4096 -n /dev/vd$i; done on OSD to check. On any found problems try to keep services running but in read-only mode like we did some time in the past, at least stop openqa-scheduler and such, and run xfs_repair without -n on the according storage devices during off-times with pre-announcements e.g. Thursday during the maintenance window

Rollback steps

  • DONE re-enable cron service for OSD in openqa-service in /etc/crontab

Related issues 2 (1 open1 closed)

Copied from openQA Infrastructure - action #162359: Change OSD root to more modern filesystem mount optionsNew2024-06-17

Actions
Copied to openQA Infrastructure - action #162365: OSD can fail on xfs_repair OOM conditions size:SResolvedjbaier_cz2024-06-17

Actions
Actions #1

Updated by okurz 13 days ago

  • Copied from action #162359: Change OSD root to more modern filesystem mount options added
Actions #2

Updated by okurz 13 days ago

  • Copied to action #162365: OSD can fail on xfs_repair OOM conditions size:S added
Actions #3

Updated by livdywan 10 days ago

  • Subject changed from 2024-06-15 osd not accessible - ensure healthy filesystems to 2024-06-15 osd not accessible - ensure healthy filesystems size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by mkittler 9 days ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #5

Updated by mkittler 9 days ago

  • Status changed from In Progress to Feedback

The checking also already requires the filesystems to be unmounted (or at least mounted read-only):

for i in b c d e; do echo "testing /dev/vd$i" ; sudo xfs_repair -m 4096 -n /dev/vd$i; done
testing /dev/vdb
xfs_repair: /dev/vdb contains a mounted and writable filesystem

fatal error -- couldn't initialize XFS library
testing /dev/vdc
xfs_repair: /dev/vdc contains a mounted and writable filesystem

fatal error -- couldn't initialize XFS library
testing /dev/vdd
xfs_repair: /dev/vdd contains a mounted and writable filesystem

fatal error -- couldn't initialize XFS library
testing /dev/vde
xfs_repair: /dev/vde contains a mounted and writable filesystem

fatal error -- couldn't initialize XFS library

I suppose to check vdb we need to stop all the services as it contains the database. (I suppose systemctl stop srv.mount will do that for us. Probably systemctl stop nginx would make sense as well.)

vdc, vdd, and vde contain assets, results and fixed assets and archived jobs respectively. So stopping the scheduler wouldn't be enough; we also have to stop all jobs to prevent downloading/uploading errors. Probably it is also the cleanest to just stop the services. If we're quick enough jobs might not be actually impacted and continue after retrying.

Now we just need to find a good time to do this. Maybe today in the late afternoon?

Actions #6

Updated by mkittler 9 days ago · Edited

  • Status changed from Feedback to In Progress

vdb is tricky as we also write the system journal there. I also had to stop salt. So I'll try the other ones first. (To get PostgreSQL running again I had to run sudo systemctl start var-lib-pgsql.mount first.)

EDIT: I couldn't even unmount /srv/space-slow because it was used by /results/share/factory/hdd/fixed but I could not unmount that (umount: /results/share/factory/hdd/fixed: target is busy.) despite lsof /results/share/factory/hdd/fixed not showing anything.

Somehow I don't think we can do this with just a short interruption of the production services. We should probably boot the machine cleanly into some minimal target (not sure yet how to do that) when there is a good/non-busy window. Or we should simply let the filesystem check run on the next. In any case, we should probably have someone with hypervisor access be ready¹.

¹ Note that I was already afraid I lost access just now because systemctl stop srv.mount kicked me out of the SSH session. I find that very strange because the command did actually nothing. I could just login again and /srv was also not unmounted because it was busy anyway. It also didn't cause a restart of sshd.service because as of now it has still been running for over a day. I invoked systemctl stop srv.mount several times and it always kicked me out but luckily without further consequences.

Actions #7

Updated by mkittler 9 days ago

  • Status changed from In Progress to Workable
Actions #8

Updated by okurz 9 days ago

seemingly forgot to re-enable openqa-scheduler again triggering an alert. Started the scheduler again, should be good.

Actions #9

Updated by okurz 9 days ago

  • Due date set to 2024-07-05
  • Status changed from Workable to In Progress
  • Assignee changed from mkittler to okurz

giving this another try. In a screen session unmounted vd{c,d,e}, stopped services and running in parallel

xfs_repair -m 4096 /dev/vdc
xfs_repair -m 4096 /dev/vdd
xfs_repair -m 4096 /dev/vde

vdc finished quickly after about 5m.

Actions #10

Updated by okurz 9 days ago

  • Description updated (diff)
Actions #11

Updated by okurz 9 days ago

  • Description updated (diff)

vdd finished after 1.5h. Remounted all and started services.

Called

for i in failed incomplete parallel_failed; do host=openqa.suse.de failed_since="2024-06-21 20:00" result="result='$i'" additional_filters="reason like '%502%'" comment="label:poo#162362" ./openqa-advanced-retrigger-jobs; done

I found that within the running OS systemd-fsck@dev-disk-by\x2duuid-b5377fcf\x2d6273\x2d4f38\x2da471\x2dcefff08c60b7.service was started and failed recurringly as I was running an explicit foreground xfs_repair which made the device busy. This might mean that the filesystem for vdd is still treated as needing repairs.

I guess we need to consider disabling the filesystem checks requested by /etc/fstab and plan to do them on our own explicitly.

Actions #12

Updated by jbaier_cz 9 days ago

okurz wrote in #note-11:

I guess we need to consider disabling the filesystem checks requested by /etc/fstab and plan to do them on our own explicitly.

Yes, that is actually the same idea I was able to find out in #162365#note-11.

Actions #13

Updated by okurz 5 days ago

  • Due date deleted (2024-07-05)
  • Status changed from In Progress to Resolved

I found https://www.2daygeek.com/repairing-xfs-file-system-in-rhel/ explaining how to artificially corrupt filesystems for experimentation purposes. But not following with that for now.

I tried xfs_scrub but that fails with

EXPERIMENTAL xfs_scrub program in use! Use at your own risk!
Error: /srv: Kernel metadata scrubbing facility is not available.
Info: /srv: Scrub aborted after phase 1.
/srv: operational errors found: 1

same for other partitions so this is of no help.

dmesg | grep XFS reports no problems

[    8.584115] SGI XFS with ACLs, security attributes, quota, no debug enabled
[    8.717693] XFS (vdb): Mounting V5 Filesystem
[    8.814670] XFS (vdb): Ending clean mount
[   76.995026] XFS (vdc): Mounting V5 Filesystem
[   77.064424] XFS (vdc): Ending clean mount
[  314.123230] XFS (vdd): Mounting V5 Filesystem
[  315.590921] XFS (vdd): Ending clean mount
[  316.842499] XFS (vde): Mounting V5 Filesystem
[  316.909011] XFS (vde): Ending clean mount
[184978.697206] XFS (vde): Unmounting Filesystem
[185036.712008] XFS (vdd): Unmounting Filesystem
[185094.638973] XFS (vdc): Unmounting Filesystem
[187443.039406] XFS (vdc): Mounting V5 Filesystem
[187443.162253] XFS (vdc): Ending clean mount
[187445.350033] XFS (vde): Mounting V5 Filesystem
[187445.403976] XFS (vde): Ending clean mount
[191624.557251] XFS (vdd): Mounting V5 Filesystem
[191625.868496] XFS (vdd): Ending clean mount

so I assume that means that we are safe.

Actions

Also available in: Atom PDF