[Beowulf] Big storage

Thu Sep 13 09:45:00 PDT 2007

According to Bruce Allen:
>
> This thread has been evolving, but I'd like to push it back a bit.
> Earlier in the thread you pointed out the CERN study on silent data
> corruption:
>
> http://fuji.web.cern.ch/fuji/talk/2007/kelemen-2007-C5-Silent_Corruptions.pdf
>
Actually, I was not the one who pointed out this study but I can't
remember who did.

> If you are not already doing this, would it be possible for you to run
> fsprobe(8) on your X4500 boxes to see if there are any silent data
> corruption issues there?  You have a large enough storage farm to gather
> meaningful statistics.
>
We are not using fsprobe on our X4500.

There are two reasons:
 . ZFS has built-in error detection (through "zpool scrub") and we are
   (maybe naively) relying on this to detect and correct data corruption
   which would be otherwise silent;
 . due to some ZFS limitation (there are some :-) fsprobe does not
   work reliably with ZFS.

I'll try to be as concise as possible on the last point.

In order to make sure that data are actually written to/read from disk
and not from cache, fsprobe (optionally) uses Direct I/O (buffer
cache bypass).

Since Direct I/O is not supported by ZFS, you can't actually be certain
that you're reading from disk and not from the cache (although you can
get "some" guarantee that you actually write to the disk using "data
synchronous" writes -- aka O_DSYNC or the "fsync()" family of POSIX
functions).

Really flushing the cache for ZFS filesystems is intrusive (to say the
least), you need to either:
 . reboot;
 . unmount all ZFS filesystems then unload the ZFS kernel module(s) and
   start over (reload, remount);
 . export the ZFS pool and import it back.

So, my point is that if you're not reliably reading from disk, you
can't reliably detect disk errors.

The main point (and one of the intents, I guess) of the initial report
by Peter Kelemen (and his boss) was to give very strong incentives to
the LHC software developpers to make sure that data files (and more
generally all software that handle LHC data) include ways to check data
integrity by storing/handling data checksums/hash/error detection and
correction information.

That's an absolute requirement for reliable long term data storage
since the amount of data planned to be generated by the 4 LHC
experiments is so huge (mind boggling actually).
The estimated data production rate will be of almost 1 petabyte per
month (10 PB/year).

Regarding statistics, we plan to collect "zpool scrub" results and
SMART statistics on all our X4500, but it's not done yet.

Loïc.
-- 
| Loïc Tortay <tortay at cc.in2p3.fr> -     IN2P3 Computing Centre     |