[Beowulf] X5500

Shannon V. Davidson svdavidson at charter.net
Fri Apr 3 15:40:44 PDT 2009

Greg Lindahl wrote:
> On Thu, Apr 02, 2009 at 09:05:26PM -0700, Ellis Wilson wrote:
>> Though entertainingly put, it would be an error to say "ECC is a 
>> requirement" for everyone in a "cluster group".  I can think of more 
>> than just a few purposes for clusters that truly do not require the 
>> accuracy guaranteed by ECC RAM.
> The only big cluster I can think of built without ECC was built by a
> guy whose research area was making computations reliable by doing
> additional inexpensive computations to check the answer. While that
> was really clever, the cluster was intended to be a general purpose
> machine, and this answer-checking thing can only be efficiently done
> for a subset of algorithms. Oddly enough, the cluster was subsequently
> upgraded to ECC.
> I have never run into a situation where a cluster would be improved by
> leaving ECC off. I buy ECC for desktops, too, it's a small price to
> pay to avoid engineer downtime.

Amen. There's no substitute for the right answer. Bit errors don't 
magically limit themselves to floating point values. They can affect 
anything stored in memory including pointers, indexes, bitmaps, and 
code. Debugging software on broken hardware is not fun.

Non-ECC is great for systems where the expectation of getting the right 
answer is not particularly high and the consequence of failure is not 
too bad.


