[Beowulf] Surviving a double disk failure

Thu Apr 9 19:02:31 PDT 2009

I thought I would share an unpleasant experience - surviving a double  
disk failure with raid 5.

We have >1000 core cluster, 130TB lustre setup and obviously a lot of  
spindles. Our lustre is a "cheap" scalable setup, 30 oss's with  
software raid 5.

So what happens when you get a double disk failure in an OSS?  Well  
the md device drops, lustre on the OSS obviously can't write to disk  
and clients start getting errors.  In Australia we say "it's gone  
balls up".

How to recover without losing (much) data?  Well first let me say I  
generally hate hardware raid solutions. Have a double disk failure and  
your stuffed. Rebuild the lun, rebuild you fs and say sorry to your  
clients/customers. It isn't fun, trust me, been there done that.  Not  
fun at all.

I love software raid. First you get a real CPU and lots of memory  
behind your raid controller and a real os which allows you to recover  
from double disk failure.

Ok. So what did we do?  First we noted which sectors gave the error  
and shutdown. Then remove one of the failed disks and put in a new  
one. Boot up and your md device won't start - you have 1 failed disk  
and 1 new disk.

Ok. This is where Linux is great.

Find the bad sectors and check they are faulty with dd, by reading a  
few mb around the failed sectors. Make sure you know the smallest  
block of corruption. Now dd over the top of the corruption causing the  
disk to reallocate those sectors.

Ok, now force assemble the raid device and rebuild/resync onto the new  
disk. Sure you will not have a coherent fs any more but you will have  
most of your data.

Once the raid rebuilds your new disk should be a byte for byte replica  
of the first disk you removed except for the area you dd'd over. Now  
shutdown and remove the failed disk and put the first failed disk back  
in and reboot.  Stop the raid device (which should have started in  
degraded mode) and dd the appropriate sectors from the failed disk to  
the new disk replacing the corrupt-rebuilt area (hopefully this works,  
and you didn't suffer the dubble failure at exactly the same place on  
both disks).  Then shutdown again and put another new disk in. Reboot  
and rebuild/resync your raid device onto the 2nd new disk.

Now, of course, a fsck and lustre fsck and you are back in action.   
Possibly with a MB or two of bad data.

File system recovered. Restart lustre and all you clients go through  
recovery and hopefully overwrite any bad data in your file system.   
Jobs which failed, rerun and again hopefully overwrite the small area  
that may have been corrupt.  Hopefully all is ok.

What are the lessons learnt? Well with software raid Linux is both  
your friend and enemy. The behaviour of md got us in this mess. When  
md gets an error on read it recovers the data from the other disks and  
re-writes the blocks to the failed disk hoping the disk will  
reallocate. You do get a warning saying that md encountered a  
recoverable error. So you think it is ok. BUT the disk still failed on  
read and you haven't swapped it out. Some time later when another disk  
fails hard and you get a failed read on your other dodgy disk md sees  
2 failed disks. And it's all over.

My advice:  don't let Linux collude with the disk vendors and reduce  
your reliability. Swap any disk that gets a correctable error on  
read.   Reallocation on write is fine not on read. The disk has failed.

Tim, your a genius. Thanks mate. Once I land back in the country, cold  
beers all round.

--
Stu Midgley
sdm900 at gmail.com