cluster frustrations

Wed Jan 16 10:12:02 PST 2002

Jim Phillips wrote:
> When you build a cluster, you are often taking consumer-class hardware and
> driving it much harder than a normal user.  You also have zero error
> tolerance across the entire cluster.  While in theory this should all be
> worked out in testing, cluster users are the only people likely to see
> errors in the real world.  In our case, the problem was that a BIOS
> setting of "optimal" for some PCI bus parameters was leading to occasional
> data corruption between the CPU and the network card.  Since we had nice
> network cards, capable of doing their own checksumming, the errors were
> never caught.  The was never an issue on the old cluster, which used cheap
> "tulip" cards and made the CPU do the checksumming.

Indeed, with most consumer operating systems (e.g. Windows), disk and
network errors are silently retried, and with the relatively low disk and
network rates for most desktop applications, you'd never notice a, say, 1%
error rate, since the few milliseconds added for the retry probably isn't
significant in the several second response time expected by the user.  I
doubt most users would notice the difference between it taking 1 second to
paint a web page and 1.01 seconds.  It has to get really, really bad before
there is a user noticeable degradation, probably on the order of 20-30%
loss.

In the server case, and particularly in the computationally intensive
cluster computing area, where you are loading up the machines to the limit
(or, at least, you're trying to), and you've got users (i.e. system admins)
who are sensitive to small variations in performance, that 1% error rate
would be quite noticeable, particularly if it causes cascading problems
which amplify it. (Nobody would notice if I worked 1% slower at my desk, due
to slightly slower network or disk speed, since the uncertainty in my work
output (per unit time) is much much greater than that.)

This just goes to show that good performance monitoring tools that let you
see the raw error rates are important.