Project

General

Profile

Broken storage week ?

Added by Anonymous about 10 years ago

You might have seen the announcement on news.opensuse.org already: one of the main storages had some problems last week and we still suffer from the effects (one virtual array is still missing).

But there was also another, smaller issue: the internal storage on one of the backend servers also reported problems. The machine is a bit older and using an internal RAID array to provide 10 1TB disks to the running SLES system. As the RAID array controller is also "very old" (~ 6-7 years), each of the 1TB disks is exported as "single RAID" - as the controller is just not able to handle a RAID with more than 1 TB in size. In the end there is a software RAID 6 running over all 10 disks. Now our monitoring notified us that the RAID is degraded: one of the 10 disks died (a naughty little beggar who claims "btrfs is the reason" ;-). So far so good. But guess how frustrated an admin can be if he tries to identify the broken disk and there is absolutely NO LIGHT or other indicator at the disk cages? So guessing the "right" disk - and - heya: choose the wrong one. But happily with RAID 6 you can loose two hard disks without a problem. So re-inserting the disk and waiting for the RAID to finish the rebuild, trying... But sadly the RAID controller now starts to break: right after inserting the disk, the controller lost nearly all disks, resulting in an array with a lot of "spares". Reboot solved the problem - for ~10 minutes...

So after 60 minutes of fighting against old hardware, we decided to go with another solution: using an old, dedicated FC storage. Luckily the old server did come back successfully after inserting the extra FC card and even the RAID controller allowed us at least to mount the degraded RAID in read-only mode to copy over the last bits and bites.

After 3 hours of "mdadm /dev/md2 --add /dev/sdx1; mdadm --stop /dev/md2; mdadm assemble --scan --force; mdadm ..." loops , we can report that the backend for the staging projects is back without any data loss...


Comments