[Beowulf] ECC

Sun Nov 4 11:23:54 PST 2012

On 11/4/12 9:46 AM, "Vincent Diepeveen" <diep at xs4all.nl> wrote:

>
>On Nov 4, 2012, at 5:53 PM, Lux, Jim (337C) wrote:
>
>>
>>
>> On 11/3/12 6:55 PM, "Robin Whittle" <rw at firstpr.com.au> wrote:
>>> <snip>
>
>[snip]
>
>>>
>>> For serious work, the cluster and its software needs to survive power
>>> outages, failure of individual servers and memory errors, so ECC
>>> memory
>>> is a good investment . . . which typically requires more expensive
>>> motherboards and CPUs.
>>
>>
>> Actually, I don't know that I would agree with you about ECC, etc.
>> ECC
>> memory is an attempt to create "perfect memory".  As you scale up, the
>> assumption of "perfect computation" becomes less realistic, so that
>> means
>> your application (or the infrastructure on which the application
>> sits) has
>> to explicitly address failures, because at sufficiently large
>> scale, they
>> are inevitable. 
>
>
>More interesting is the ECC discussion.
>
>ECC is simply a requirement IMHO, not a 'luxury thing' as some
>hardware engineers see it.

Depends on your computational model.  Would you rather spend money on ECC
or on more processors?
ECC comes at a cost in speed as well.  There is some non-zero time
required to compute the syndrome bits and do the correction on the read.
Sure, you can pipeline it, but there's some extra latency inevitably
added. 

>
>I know some memory engineers disagree here - for example one of them
>mentionned to me that "putting ECC onto a GPU
>is nonsense as it is a lot of effort and DDR5 already has a built in
>CRC" something like that (if i remember the quote correctly).
>
>But they do not administer servers themselves.

What's good for a server may or may not be optimum for computation.

>
>Also they don't understand the accuracy or better LACK of accuracy in
>checking calculations done by
>some who calculate at big iron. If you calculate at a cluster and get
>after some months a result - reality is simply that
>99% of the researchers isn't as good as the Einstein league
>researchers and 90% simply sucks too much by any standards
>in this sense that they wouldn't see an obvious problem get generated
>by a bitflip here or there. They just would
>happily invent a new theory, as we already have seen too much in
>history.

One hopes that people doing these sorts of computations are smart enough
to figure out how to validate the results. Even with ECC, the error rate
is not zero.

Typically, things like numerical precision things might be more of a
problem, which would need to be addressed.

>
>By simply putting in ECC there you avoid in some percent of the cases
>this 'interpreting the results correctly' problem.

Yes, but my contention is that a more generalized approach might prove
better.  It's not something that like "ECC is universally the best
solution"
>