I awoke this morning to find an email from my RAID host (Linux software RAID) telling me that a drive had failed. It's consumer hardware, it's not a big deal. I have cold spares. However, when I got to the server the whole thing was unresponsive. At some point I figured I had no choice but to cut the power and restart.
The system came up, the failed drive is still marked as failed, /proc/mdstat looks correct. However, it won't mount /dev/md0 and tells me:
mount: /dev/md0: can't read superblock
Now I'm starting to worry. So I try xfs_check and xfs_repair, the former of which tells me:
xfs_check: /dev/md0 is invalid (cannot read first 512 bytes)
and the latter:
Phase 1 - find and verify superblock...
superblock read failed, offset 0, size 524288, ag 0, rval 0
fatal error -- Invalid argument
Now I'm getting scared. So far my Googling has been to no avail. Now, I'm not in panic mode just yet because I've been scared before and it's always worked out within a few days. I can still pop in my cold spare tonight, let it rebuild (for 36 hours), and then see if the file system is in a more usable state. I can maybe even try to re-shape the array back down to 10 drives from the current 11 (since I haven't grown the file system yet) and see if that helps (which takes the better part of a week).
But while I'm at work, before I can do any of this at home tonight, I'd like to seek the help of experts here.
Does anybody more knowledgeable about file systems and RAID have any recommendations? Maybe there's something I can do over SSH from here to further diagnose the file system problem, or even perchance repair it?
Edit:
Looks like /proc/mdstat is actually offering a clue:
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath]
md0 : inactive sdk1[10] sdh1[7] sdj1[5] sdg1[8] sdi1[6] sdc1[2] sdd1[3] sde1[4] sdf1[9] sdb1[0]
19535119360 blocks
inactive? So I try to assemble the array:
# mdadm --assemble /dev/md0 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1
mdadm: device /dev/md0 already active - cannot assemble it
It's already active? Even though /proc/mdstat is telling me that it's inactive?