[Beowulf] Surviving a double disk failure
landman at scalableinformatics.com
Sat Apr 11 19:38:16 PDT 2009
Stuart Midgley wrote:
> Thanks to all the responses, it has been interesting reading. We have
> started using raid6 on newer servers and will slowely get rid of our old
> raid5 servers.
> I found the comments about scrubbing very interesting. What do people
> do with their file systems? We couldn't afford the reduced performance
Software RAIDs (our DeltaV) are scrubbed once a week. Hardware raids
are scrubbed also once a week. Basically errors can accumulate.
Scrubbing isn't perfect, and as Michael and others have pointed out,
there can be bugs. But honestly, I am of the opinion that the several
hours of scrubbing which results in reduced performance, are a heck of a
lot better than dealing with down time due to an "event".
Scrubbing occurs in the background, and you can limit its impact.
> and time for scrubbing. We run our Lustre setup almost flat out all the
> time. We regularly do over a PB of io in a week (we often have our
> total throughput at ~3GB/s for weeks on end). We use lustre as our
> scratch space so backups are not possible. Nothing could get the data
> off fast enough between us creating/using/deleting it.
> Of course, the fact that we basically run at 95% full all the time is as
> good as scrubbing :)
Not quite ... Scrubbing is a bit more of a structured testing and
repair. The I/O may leave coverage holes ... even at 95% capacity.
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf