Project

General

Profile

Actions

tickets #154057

open

Degraded RAID arrays on falkor21 and squanchy

Added by crameleon 5 months ago. Updated about 1 month ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Physical infrastructure / Hardware
Target version:
-
Start date:
2024-01-22
Due date:
% Done:

80%

Estimated time:

Description

MDADM monitor reported:

falkor21:

falkor21 (Hypervisor):~ # mdadm -D /dev/md127
/dev/md127:
           Version : 1.0
     Creation Time : Mon Oct  9 21:44:30 2023
        Raid Level : raid1
        Array Size : 233832256 (223.00 GiB 239.44 GB)
     Used Dev Size : 233832256 (223.00 GiB 239.44 GB)
      Raid Devices : 2
     Total Devices : 1
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Mon Jan 22 22:12:48 2024
             State : clean, degraded 
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : any:falkor21arr0
              UUID : 6de056e4:33e6bbd1:e48dc0e0:188adfc1
            Events : 122005

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8       17        1      active sync   /dev/sdb1

squanchy:

squanchy (Hypervisor):~ # mdadm -D /dev/md126
/dev/md126:
           Version : 1.2
     Creation Time : Fri Oct 13 15:03:35 2023
        Raid Level : raid1
        Array Size : 937560384 (894.13 GiB 960.06 GB)
     Used Dev Size : 937560384 (894.13 GiB 960.06 GB)
      Raid Devices : 2
     Total Devices : 1
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Mon Jan 22 22:09:08 2024
             State : clean, degraded 
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : squanchy_arr1
              UUID : cb9cab59:ea55a8fd:51d0df9b:e0facb17
            Events : 150403

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8       48        1      active sync   /dev/sdd

Both disks seem to be there:

falkor21:

Disk /dev/sda: 223.57 GiB, 240057409536 bytes, 468862128 sectors
Disk model: SAMSUNG MZ7LH240
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 13EC682C-9D26-4AD0-B336-9E7B4BAEAA2B

Device     Start       End   Sectors  Size Type
/dev/sda1   2048 467666943 467664896  223G Linux RAID


Disk /dev/sdb: 223.57 GiB, 240057409536 bytes, 468862128 sectors
Disk model: SAMSUNG MZ7LH240
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 851701FE-1F05-4691-B39E-231C1D98EF25

Device     Start       End   Sectors  Size Type
/dev/sdb1   2048 467666943 467664896  223G Linux RAID

squanchy:

Disk /dev/sdc: 894.25 GiB, 960197124096 bytes, 1875385008 sectors
Disk model: SAMSUNG MZ7L3960
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: C0E24D1F-8F71-4E15-A961-430DEB131448


Disk /dev/sdd: 894.25 GiB, 960197124096 bytes, 1875385008 sectors
Disk model: SAMSUNG MZ7L3960
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Adding some to-do's for this as tasks to the ticket.

On falkor21, the affected array hosts the root/system partitions. On squanchy, the affected array "only" hosts the data disk for downloadtmp.i.o.o (aka download-prg.i.o.o).
In either case we should remediate this soon.


Checklist

  • Recover array
  • Assess why disk dropped from the array
  • Verify disks are healthy
Actions #1

Updated by crameleon 5 months ago

  • Description updated (diff)
Actions #2

Updated by crameleon 5 months ago

  • Private changed from Yes to No
Actions #3

Updated by crameleon 5 months ago

  • Status changed from New to In Progress
  • Assignee set to crameleon

I couldn't locate why the disk was removed, but while trying to assess its health, I noticed we are missing smartd to monitor general disk health. I will deploy this soon.

On Squanchy, I for now "just" readded the disk to the array, it is rebuilding seemingly without any issues:

squanchy (Hypervisor):~ # mdadm /dev/md126 --add /dev/sdc
mdadm: added /dev/sdc
squanchy (Hypervisor):~ # mdadm -D /dev/md126
/dev/md126:
...
    Rebuild Status : 7% complete
...
    Number   Major   Minor   RaidDevice State
       2       8       32        0      spare rebuilding   /dev/sdc
       1       8       48        1      active sync   /dev/sdd

If this goes well I will do the same on falkor21.

Actions #4

Updated by crameleon 5 months ago

Seems happy:

squanchy (Hypervisor):~ # mdadm -D /dev/md126
/dev/md126:
           Version : 1.2
     Creation Time : Fri Oct 13 15:03:35 2023
        Raid Level : raid1
        Array Size : 937560384 (894.13 GiB 960.06 GB)
     Used Dev Size : 937560384 (894.13 GiB 960.06 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Wed Jan 24 16:25:56 2024
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : squanchy_arr1
              UUID : cb9cab59:ea55a8fd:51d0df9b:e0facb17
            Events : 155414

    Number   Major   Minor   RaidDevice State
       2       8       32        0      active sync   /dev/sdc
       1       8       48        1      active sync   /dev/sdd

So will do the same on falkor21.

I started working on Salt for smartd as well: https://github.com/openSUSE/salt-formulas/pull/108.

Actions #5

Updated by crameleon 5 months ago

  • Checklist item Recover array set to Done

On falkor21, it did not let me re-add the device the same way, telling me the resource was busy, albeit findmnt/lsof not reporting anything open. After a reboot, the disk was magically part of the array again:

falkor21 (Hypervisor):~ # mdadm -D /dev/md127
/dev/md127:
           Version : 1.0
     Creation Time : Mon Oct  9 21:44:30 2023
        Raid Level : raid1
        Array Size : 233832256 (223.00 GiB 239.44 GB)
     Used Dev Size : 233832256 (223.00 GiB 239.44 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Wed Jan 24 18:20:23 2024
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : any:falkor21arr0
              UUID : 6de056e4:33e6bbd1:e48dc0e0:188adfc1
            Events : 132369

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
Actions #7

Updated by crameleon about 1 month ago

  • Checklist item Verify disks are healthy set to Done
  • % Done changed from 50 to 80

SMART monitoring and alerting deployed.

Actions

Also available in: Atom PDF