[Beowulf] Big storage

Bruce Allen ballen at gravity.phys.uwm.edu
Fri Sep 14 06:04:04 PDT 2007

Hi Loic,

>> This thread has been evolving, but I'd like to push it back a bit. 
>> Earlier in the thread you pointed out the CERN study on silent data 
>> corruption:
>> http://fuji.web.cern.ch/fuji/talk/2007/kelemen-2007-C5-Silent_Corruptions.pdf
> Actually, I was not the one who pointed out this study but I can't
> remember who did.

Oops.  Sorry Loic, sorry Leif.  I'm clearly too senile to deal with two 
different four-letter names starting with 'L'.

>> If you are not already doing this, would it be possible for you to run
>> fsprobe(8) on your X4500 boxes to see if there are any silent data
>> corruption issues there?  You have a large enough storage farm to gather
>> meaningful statistics.
> We are not using fsprobe on our X4500.
> There are two reasons:


I still think that the results would be interesting.

In response to the reasons you gave:

[1] I agree that if ZFS + hardware works as it is supposed to, there will 
not be any corruption.  But it would be nice to prove this via experiment.

[2] You can probably force writes to disk by simply writing files too 
large to fit into the memory cache.  Or modify fsprobe (or ask Peter to 
modify it) so that it fsync()s after writes rather than using the direct 
IO to bypass the device block buffer layer.

In any case by the end of the year I should have at least ten X4500s, and 
can do some testing myself.  But your collection is an order of magnitude 
larger, so you can collect much more useful statistics.  If those 
statistics show no data corruption, then someone like myself with many 
fewer systems can be very confident that no silent corruption is occuring.


