[Beowulf] Big storage

Loic Tortay tortay at cc.in2p3.fr
Sun Sep 16 00:11:39 PDT 2007

[OK, I thought my previous message would be the last, but *this* is
Greg Lindahl !]

According to Greg Lindahl:
> On Sat, Sep 15, 2007 at 10:37:03AM +0200, Loic Tortay wrote:
> > Therefore, we are not running fsprobe on our X4500s since it is
> > actually less useful than "zpool scrub" for detecting corruptions or
> > problems on data.
> .. how does zpool scrub double-check that zpool scrub is working?
How does fsprobe double-check that fsprobe is working ?

People, please go read Peter Kelemen slides (or watch his
presentation), and see that sometimes he was unable to see the
corruptions reported by fsprobe.

Each and every non trivial piece of software (and hardware) has bugs
(so does certainly ZFS and most probaby fsprobe too).
That's a fact of life just like silent data corruptions (to paraphrase
Peter's slides).

How come it's somehow normal to express scepticism on ZFS but not on
fsprobe ?  I dare say that I am sceptical of both.

> 								    The
> point of extra user-run testing is often to make sure that your vendor
> did not screw up. Of course, you are welcome to not follow advice,
> good or bad.
The point of extra user testing with fsprobe is moot since fsprobe
provides no *cheap and useful* extra user testing *in my environment*.

We already have extra user testing built-in in most of the applications
(so *that* is essentially free).

Even if fsprobe doesn't find corruption doesn't mean that corruption is
not happening on the other parts of the system.

To some extent, if the initial burn-in testing does not find such
problems that is a clue that the burn-in process is insufficient (think
of it like regression testing: "we've seen this class of problems, now
we check for it").

> Several people have commented that fsprobe doesn't check existing files.
> For your system binaries, you can test them using rpm -V.
That is, in my opinion, the point of the initial report by Peter
Kelemen et al (which eventually became the slides and now is
generating that buzz, here and even on LKML and Slashdot): the
applications have to generate and check data integrity information.

> My new startup is planning on using md5 everywhere to provide an
> end-to-end check.
Again, that is the only sane and useful anwser to (silent) data
corruption: include and check data integrity information (end-to-end).

In my environment, a significant portion of the data already include
data integrity information (sometimes as a by-product of data

We have detected otherwise silent data corruption through this several
times in the past (without fsprobe and ZFS).

| Loïc Tortay <tortay at cc.in2p3.fr> -     IN2P3 Computing Centre     |

More information about the Beowulf mailing list