[Beowulf] Not quite Walmart, or, living without ECC?

Fri Nov 16 14:19:44 PST 2007

At 09:43 AM 11/16/2007, Peter St. John wrote:
>David,
>I just asked the local NT goon, "do you use ECC for the servers?" and
>he answered, "you have to". What he considers a server-class mobo
>requires ECC and he added that the tendency is now to FB-DIMM (fully
>buffered, http://en.wikipedia.org/wiki/FBDIMM). This suggests to me
>that next year(s) commodity mobos will be ECC.
>
>Of course the additional expense keeps your question interesting for
>now. I would imagine that if something is done to cover **software**
>errors, which are aeternal :-), such as periodic checkpointing, then
>adding memcheck stuff as Tony suggests seems reasonable.

The cluster environment adds some interesting aspects to the problem. 
If you have only one computer, then ECC (or even more robust, 
something like triple modular redundancy, TMR) is an easy way to go. 
Especially because there's no software development overhead. FWIW, we 
use the term EDAC, Error Detection and Correction to refer to the 
whole thing.  An ECC (Error Correcting Code) is just the particular 
coding used in an overall EDAC approach. Common ECCs are the Hamming 
codes with 3 syndrome bits for 8 data bits, etc.

However, say you had a billion work packets to do, and you're 
processing them on 1000 machines. If the work packet has some 
mechanism for self check, maybe a strategy is just to redo the work 
when the check fails.  If you have a rate constraint, then you can 
add extra processors to keep the work rate up.  Assuming here that 
you have a trade between cheap, error prone and expensive error-free computers.

In some applications where there's a hard real time constraint, the 
option of 'do-over' doesn't exist, so you're forced to a fine grained 
redundancy (EDAC or TMR).

Likewise, if your work quantas are not amenable to a do-over (say, 
all 1000 processors have to participate lockstep in the next time 
step, so having one die means all wait til it's done).

Then, as you get into ECC, there's a whole lot of other issues... for 
instance, you can do "software ecc" (this is popular on 
spacecraft)... store critical values 3 times in different locations, 
and then, before using them, do the compare and vote.  This works if 
your upset rate is low (i.e. you're not worried about a hit in the 
CPU, but in something that is resident in memory for a long time) AND 
if the access rate to that critical information is low.

What these strategies attempt to do is spend more resources on bits 
with more value. (For instance, if you're transmitting digitized 
voice or music, a hit on the MSB is more audible than the LSB, so you 
might be able to choose a ECC that protects those bits better, giving 
up some protection on the others.. Maybe you have 16 data bits, and 
you use the code only on the top 8, so you've got 19 total bits to 
transmit 16, rather than 22 for 16 in a conventional Hamming 
code)  With compressed video this gets very interesting.. Errors in 
the coarse resolution blocks are MUCH more visible than errors in the 
little blocks.

OTOH.. it rapidly gets to where it's easier just to EDAC 
everything.  If you're building ASICs or trying to squeeze every last 
bit per second out of your channel, it's worth it to tailor 
things.  If you're writing software, and the labor for that is the 
big cost contributor, spending money on blanket EDAC is a no-brainer.

Jim Lux