[Beowulf] Surviving a double disk failure
Stuart Midgley
sdm900 at gmail.com
Thu Apr 9 19:02:31 PDT 2009
I thought I would share an unpleasant experience - surviving a double
disk failure with raid 5.
We have >1000 core cluster, 130TB lustre setup and obviously a lot of
spindles. Our lustre is a "cheap" scalable setup, 30 oss's with
software raid 5.
So what happens when you get a double disk failure in an OSS? Well
the md device drops, lustre on the OSS obviously can't write to disk
and clients start getting errors. In Australia we say "it's gone
balls up".
How to recover without losing (much) data? Well first let me say I
generally hate hardware raid solutions. Have a double disk failure and
your stuffed. Rebuild the lun, rebuild you fs and say sorry to your
clients/customers. It isn't fun, trust me, been there done that. Not
fun at all.
I love software raid. First you get a real CPU and lots of memory
behind your raid controller and a real os which allows you to recover
from double disk failure.
Ok. So what did we do? First we noted which sectors gave the error
and shutdown. Then remove one of the failed disks and put in a new
one. Boot up and your md device won't start - you have 1 failed disk
and 1 new disk.
Ok. This is where Linux is great.
Find the bad sectors and check they are faulty with dd, by reading a
few mb around the failed sectors. Make sure you know the smallest
block of corruption. Now dd over the top of the corruption causing the
disk to reallocate those sectors.
Ok, now force assemble the raid device and rebuild/resync onto the new
disk. Sure you will not have a coherent fs any more but you will have
most of your data.
Once the raid rebuilds your new disk should be a byte for byte replica
of the first disk you removed except for the area you dd'd over. Now
shutdown and remove the failed disk and put the first failed disk back
in and reboot. Stop the raid device (which should have started in
degraded mode) and dd the appropriate sectors from the failed disk to
the new disk replacing the corrupt-rebuilt area (hopefully this works,
and you didn't suffer the dubble failure at exactly the same place on
both disks). Then shutdown again and put another new disk in. Reboot
and rebuild/resync your raid device onto the 2nd new disk.
Now, of course, a fsck and lustre fsck and you are back in action.
Possibly with a MB or two of bad data.
File system recovered. Restart lustre and all you clients go through
recovery and hopefully overwrite any bad data in your file system.
Jobs which failed, rerun and again hopefully overwrite the small area
that may have been corrupt. Hopefully all is ok.
What are the lessons learnt? Well with software raid Linux is both
your friend and enemy. The behaviour of md got us in this mess. When
md gets an error on read it recovers the data from the other disks and
re-writes the blocks to the failed disk hoping the disk will
reallocate. You do get a warning saying that md encountered a
recoverable error. So you think it is ok. BUT the disk still failed on
read and you haven't swapped it out. Some time later when another disk
fails hard and you get a failed read on your other dodgy disk md sees
2 failed disks. And it's all over.
My advice: don't let Linux collude with the disk vendors and reduce
your reliability. Swap any disk that gets a correctable error on
read. Reallocation on write is fine not on read. The disk has failed.
Tim, your a genius. Thanks mate. Once I land back in the country, cold
beers all round.
--
Stu Midgley
sdm900 at gmail.com
More information about the Beowulf
mailing list