[Beowulf] Surviving a double disk failure

Sat Apr 11 19:38:16 PDT 2009

Stuart Midgley wrote:
> Thanks to all the responses, it has been interesting reading.  We have 
> started using raid6 on newer servers and will slowely get rid of our old 
> raid5 servers.
> 
> I found the comments about scrubbing very interesting.  What do people 
> do with their file systems?  We couldn't afford the reduced performance 

Software RAIDs (our DeltaV) are scrubbed once a week.  Hardware raids 
are scrubbed also once a week.  Basically errors can accumulate. 
Scrubbing isn't perfect, and as Michael and others have pointed out, 
there can be bugs.  But honestly, I am of the opinion that the several 
hours of scrubbing which results in reduced performance, are a heck of a 
lot better than dealing with down time due to an "event".

Scrubbing occurs in the background, and you can limit its impact.

> and time for scrubbing.  We run our Lustre setup almost flat out all the 
> time.  We regularly do over a PB of io in a week (we often have our 
> total throughput at ~3GB/s for weeks on end).  We use lustre as our 
> scratch space so backups are not possible.  Nothing could get the data 
> off fast enough between us creating/using/deleting it.
> 
> Of course, the fact that we basically run at 95% full all the time is as 
> good as scrubbing :)

Not quite ...  Scrubbing is a bit more of a structured testing and 
repair.  The I/O may leave coverage holes ... even at 95% capacity.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615