cluster frustrations

Wed Jan 16 08:31:06 PST 2002

Hi,

I've had Scyld running successfully for quite a while, and have even
taught others (http://www.ks.uiuc.edu/Research/namd/tutorial/NCSA2001/).
I know what I'm doing, and have even set up an older non-Scyld cluster,
but I was tearing my hair out for several weeks at the beginning because
of random crashes.  These turned out to be hardware and BIOS related
rather than software-related, altough different versions of the software
exhibited the problems to varying degrees.

When you build a cluster, you are often taking consumer-class hardware and
driving it much harder than a normal user.  You also have zero error
tolerance across the entire cluster.  While in theory this should all be
worked out in testing, cluster users are the only people likely to see
errors in the real world.  In our case, the problem was that a BIOS
setting of "optimal" for some PCI bus parameters was leading to occasional
data corruption between the CPU and the network card.  Since we had nice
network cards, capable of doing their own checksumming, the errors were
never caught.  The was never an issue on the old cluster, which used cheap
"tulip" cards and made the CPU do the checksumming.

A normal user would drive maybe 100 MB per day across that network card,
probably at 10 Mbit, or 1/10 of it's peak capacity, almost all of the data
would be incoming, probably web images.  We were driving 100 MB across
every 15 seconds, which is 5000x more opportunities for error.  Put 32
machines together and you have over 100,000x the error rate that a typical
user would see.  Add in a 10x lower tolerance for program failure and you
could easily say that a cluster user is demanding one million times more
hardware reliability than a normal desktop user.

This is why server-class, error-correcting hardware exists.

-Jim