tickets #154057: Degraded RAID arrays on falkor21 and squanchy - openSUSE admin - openSUSE Project Management Tool

Actions

Copy link

tickets #154057

open

Degraded RAID arrays on falkor21 and squanchy

Added by crameleon 5 months ago. Updated about 1 month ago.

Status:

In Progress

Priority:

Normal

Assignee:

crameleon

Category:

Physical infrastructure / Hardware

Target version:

Start date:

2024-01-22

Due date:

% Done:

80%

Estimated time:

Description

MDADM monitor reported:

falkor21:¶

falkor21 (Hypervisor):~ # mdadm -D /dev/md127
/dev/md127:
           Version : 1.0
     Creation Time : Mon Oct  9 21:44:30 2023
        Raid Level : raid1
        Array Size : 233832256 (223.00 GiB 239.44 GB)
     Used Dev Size : 233832256 (223.00 GiB 239.44 GB)
      Raid Devices : 2
     Total Devices : 1
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Mon Jan 22 22:12:48 2024
             State : clean, degraded 
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : any:falkor21arr0
              UUID : 6de056e4:33e6bbd1:e48dc0e0:188adfc1
            Events : 122005

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8       17        1      active sync   /dev/sdb1

squanchy:¶

squanchy (Hypervisor):~ # mdadm -D /dev/md126
/dev/md126:
           Version : 1.2
     Creation Time : Fri Oct 13 15:03:35 2023
        Raid Level : raid1
        Array Size : 937560384 (894.13 GiB 960.06 GB)
     Used Dev Size : 937560384 (894.13 GiB 960.06 GB)
      Raid Devices : 2
     Total Devices : 1
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Mon Jan 22 22:09:08 2024
             State : clean, degraded 
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : squanchy_arr1
              UUID : cb9cab59:ea55a8fd:51d0df9b:e0facb17
            Events : 150403

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8       48        1      active sync   /dev/sdd

Both disks seem to be there:

falkor21:¶

Disk /dev/sda: 223.57 GiB, 240057409536 bytes, 468862128 sectors
Disk model: SAMSUNG MZ7LH240
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 13EC682C-9D26-4AD0-B336-9E7B4BAEAA2B

Device     Start       End   Sectors  Size Type
/dev/sda1   2048 467666943 467664896  223G Linux RAID


Disk /dev/sdb: 223.57 GiB, 240057409536 bytes, 468862128 sectors
Disk model: SAMSUNG MZ7LH240
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 851701FE-1F05-4691-B39E-231C1D98EF25

Device     Start       End   Sectors  Size Type
/dev/sdb1   2048 467666943 467664896  223G Linux RAID

squanchy:¶

Disk /dev/sdc: 894.25 GiB, 960197124096 bytes, 1875385008 sectors
Disk model: SAMSUNG MZ7L3960
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: C0E24D1F-8F71-4E15-A961-430DEB131448


Disk /dev/sdd: 894.25 GiB, 960197124096 bytes, 1875385008 sectors
Disk model: SAMSUNG MZ7L3960
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Adding some to-do's for this as tasks to the ticket.

On falkor21, the affected array hosts the root/system partitions. On squanchy, the affected array "only" hosts the data disk for downloadtmp.i.o.o (aka download-prg.i.o.o).
In either case we should remediate this soon.

Hide closed

Checklist

Recover array
Assess why disk dropped from the array
Verify disks are healthy

Actions

Copy link

Updated by crameleon 5 months ago

Description updated (diff)

Actions

Copy link

Updated by crameleon 5 months ago

Private changed from Yes to No

Actions

Copy link

Updated by crameleon 5 months ago

Status changed from New to In Progress
Assignee set to crameleon

I couldn't locate why the disk was removed, but while trying to assess its health, I noticed we are missing smartd to monitor general disk health. I will deploy this soon.

On Squanchy, I for now "just" readded the disk to the array, it is rebuilding seemingly without any issues:

squanchy (Hypervisor):~ # mdadm /dev/md126 --add /dev/sdc
mdadm: added /dev/sdc
squanchy (Hypervisor):~ # mdadm -D /dev/md126
/dev/md126:
...
    Rebuild Status : 7% complete
...
    Number   Major   Minor   RaidDevice State
       2       8       32        0      spare rebuilding   /dev/sdc
       1       8       48        1      active sync   /dev/sdd

If this goes well I will do the same on falkor21.

Actions

Copy link

Updated by crameleon 5 months ago

Seems happy:

squanchy (Hypervisor):~ # mdadm -D /dev/md126
/dev/md126:
           Version : 1.2
     Creation Time : Fri Oct 13 15:03:35 2023
        Raid Level : raid1
        Array Size : 937560384 (894.13 GiB 960.06 GB)
     Used Dev Size : 937560384 (894.13 GiB 960.06 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Wed Jan 24 16:25:56 2024
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : squanchy_arr1
              UUID : cb9cab59:ea55a8fd:51d0df9b:e0facb17
            Events : 155414

    Number   Major   Minor   RaidDevice State
       2       8       32        0      active sync   /dev/sdc
       1       8       48        1      active sync   /dev/sdd

So will do the same on falkor21.

I started working on Salt for smartd as well: https://github.com/openSUSE/salt-formulas/pull/108.

Actions

Copy link

Updated by crameleon 5 months ago

Checklist item Recover array set to Done

On falkor21, it did not let me re-add the device the same way, telling me the resource was busy, albeit findmnt/lsof not reporting anything open. After a reboot, the disk was magically part of the array again:

falkor21 (Hypervisor):~ # mdadm -D /dev/md127
/dev/md127:
           Version : 1.0
     Creation Time : Mon Oct  9 21:44:30 2023
        Raid Level : raid1
        Array Size : 233832256 (223.00 GiB 239.44 GB)
     Used Dev Size : 233832256 (223.00 GiB 239.44 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Wed Jan 24 18:20:23 2024
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : any:falkor21arr0
              UUID : 6de056e4:33e6bbd1:e48dc0e0:188adfc1
            Events : 132369

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1

Actions

Copy link