[Beowulf] Curious about ECC vs non-ECC in practice

Fri May 20 10:21:12 PDT 2011

On 5/20/11 9:35 AM, "Douglas Eadline" <deadline at eadline.org> wrote:

>Joe
>
>While this is somewhat anecdotal, it may be helpful.
>
>Not a large-ish cluster, but as you may guess, I wondered
>about this for Limulus
>(http://limulus.basement-supercomputing.com)
>
>I wrote a script (will post it if anyone interested)
>that runs memtester until you stop it or it finds
>a error. I ran it on several Core2 Duo systems
>with Kingston DDR2-800 PC2-6400 memory.
>
>My experience in running small clusters
>without ECC has been very good. IMO it is also
>a question of the quality of the memory vendor.
>I never had an issue when running tests and
>benchmarks, which I do quite a bit on new
>hardware e.g.

I'm going to guess that it's highly idiosyncratic.  The timing margins on
all the signals between CPU, memory, and perhipherals are tight, they're
temperature dependent and process dependent, so you could have the exact
same design with very similar RAM and one will get errors and the other
won't.  Folks who design PCI bus interfaces for a living earn their pay,
especially if they have to make it work with lots of different mfrs: just
because all the parts meet their databook specs doesn't mean that the
system will play nice together.

Consider that for memory, you have 64 odd data lines and 20 or so address
lines and some strobes that ALL have to switch together.  A data sensitive
pattern where a bunch of lines move at the same time, and induce a bit of
a voltage into an adjacent trace, which is a bit slower or faster than the
rest, and you've got the makings of a challenging hunt for the problem.
PC board trace lengths all have to be carefully matched, loads have to be
carefully matched, etc. 66 Mhz -> 15 ns, but modern DDR rams do batches of
words separated by a few ns.

1 cm is about 10-15 cm of tracelength, but it's the loading, terminations,
and other stuff that causes a problem.  Hang a 1 pf capacitor off that 100
ohm line, and there's  a tenth of a ns time constant right there.

You could also have EMI/EMC issues that cause problems. That same ragged
edge timing margin might be fine with 8 tower cases sitting on a shelf,
but not so good with the exact same mobo and memory stacked into 1-2U
cases in a 19" rack.  Power cords and ethernet cables also carry EMI
around. 

In a large cluster these things will all be aggravated: you've got more
machines running, so you increase the error probability right there.
You've got more electrical noise on the power carried between machines.
You've typically got denser packaging.