[Beowulf] Not quite Walmart, or, living without ECC?
Jim Lux
James.P.Lux at jpl.nasa.gov
Fri Nov 16 14:19:44 PST 2007
At 09:43 AM 11/16/2007, Peter St. John wrote:
>David,
>I just asked the local NT goon, "do you use ECC for the servers?" and
>he answered, "you have to". What he considers a server-class mobo
>requires ECC and he added that the tendency is now to FB-DIMM (fully
>buffered, http://en.wikipedia.org/wiki/FBDIMM). This suggests to me
>that next year(s) commodity mobos will be ECC.
>
>Of course the additional expense keeps your question interesting for
>now. I would imagine that if something is done to cover **software**
>errors, which are aeternal :-), such as periodic checkpointing, then
>adding memcheck stuff as Tony suggests seems reasonable.
The cluster environment adds some interesting aspects to the problem.
If you have only one computer, then ECC (or even more robust,
something like triple modular redundancy, TMR) is an easy way to go.
Especially because there's no software development overhead. FWIW, we
use the term EDAC, Error Detection and Correction to refer to the
whole thing. An ECC (Error Correcting Code) is just the particular
coding used in an overall EDAC approach. Common ECCs are the Hamming
codes with 3 syndrome bits for 8 data bits, etc.
However, say you had a billion work packets to do, and you're
processing them on 1000 machines. If the work packet has some
mechanism for self check, maybe a strategy is just to redo the work
when the check fails. If you have a rate constraint, then you can
add extra processors to keep the work rate up. Assuming here that
you have a trade between cheap, error prone and expensive error-free computers.
In some applications where there's a hard real time constraint, the
option of 'do-over' doesn't exist, so you're forced to a fine grained
redundancy (EDAC or TMR).
Likewise, if your work quantas are not amenable to a do-over (say,
all 1000 processors have to participate lockstep in the next time
step, so having one die means all wait til it's done).
Then, as you get into ECC, there's a whole lot of other issues... for
instance, you can do "software ecc" (this is popular on
spacecraft)... store critical values 3 times in different locations,
and then, before using them, do the compare and vote. This works if
your upset rate is low (i.e. you're not worried about a hit in the
CPU, but in something that is resident in memory for a long time) AND
if the access rate to that critical information is low.
What these strategies attempt to do is spend more resources on bits
with more value. (For instance, if you're transmitting digitized
voice or music, a hit on the MSB is more audible than the LSB, so you
might be able to choose a ECC that protects those bits better, giving
up some protection on the others.. Maybe you have 16 data bits, and
you use the code only on the top 8, so you've got 19 total bits to
transmit 16, rather than 22 for 16 in a conventional Hamming
code) With compressed video this gets very interesting.. Errors in
the coarse resolution blocks are MUCH more visible than errors in the
little blocks.
OTOH.. it rapidly gets to where it's easier just to EDAC
everything. If you're building ASICs or trying to squeeze every last
bit per second out of your channel, it's worth it to tailor
things. If you're writing software, and the labor for that is the
big cost contributor, spending money on blanket EDAC is a no-brainer.
Jim Lux
More information about the Beowulf
mailing list