[Beowulf] Not quite Walmart, or, living without ECC?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Jim Lux James.P.Lux at jpl.nasa.govFri Nov 16 14:19:44 PST 2007
- Previous message: [Beowulf] Not quite Walmart, or, living without ECC?
- Next message: [Beowulf] Opteron
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
At 09:43 AM 11/16/2007, Peter St. John wrote: >David, >I just asked the local NT goon, "do you use ECC for the servers?" and >he answered, "you have to". What he considers a server-class mobo >requires ECC and he added that the tendency is now to FB-DIMM (fully >buffered, http://en.wikipedia.org/wiki/FBDIMM). This suggests to me >that next year(s) commodity mobos will be ECC. > >Of course the additional expense keeps your question interesting for >now. I would imagine that if something is done to cover **software** >errors, which are aeternal :-), such as periodic checkpointing, then >adding memcheck stuff as Tony suggests seems reasonable. The cluster environment adds some interesting aspects to the problem. If you have only one computer, then ECC (or even more robust, something like triple modular redundancy, TMR) is an easy way to go. Especially because there's no software development overhead. FWIW, we use the term EDAC, Error Detection and Correction to refer to the whole thing. An ECC (Error Correcting Code) is just the particular coding used in an overall EDAC approach. Common ECCs are the Hamming codes with 3 syndrome bits for 8 data bits, etc. However, say you had a billion work packets to do, and you're processing them on 1000 machines. If the work packet has some mechanism for self check, maybe a strategy is just to redo the work when the check fails. If you have a rate constraint, then you can add extra processors to keep the work rate up. Assuming here that you have a trade between cheap, error prone and expensive error-free computers. In some applications where there's a hard real time constraint, the option of 'do-over' doesn't exist, so you're forced to a fine grained redundancy (EDAC or TMR). Likewise, if your work quantas are not amenable to a do-over (say, all 1000 processors have to participate lockstep in the next time step, so having one die means all wait til it's done). Then, as you get into ECC, there's a whole lot of other issues... for instance, you can do "software ecc" (this is popular on spacecraft)... store critical values 3 times in different locations, and then, before using them, do the compare and vote. This works if your upset rate is low (i.e. you're not worried about a hit in the CPU, but in something that is resident in memory for a long time) AND if the access rate to that critical information is low. What these strategies attempt to do is spend more resources on bits with more value. (For instance, if you're transmitting digitized voice or music, a hit on the MSB is more audible than the LSB, so you might be able to choose a ECC that protects those bits better, giving up some protection on the others.. Maybe you have 16 data bits, and you use the code only on the top 8, so you've got 19 total bits to transmit 16, rather than 22 for 16 in a conventional Hamming code) With compressed video this gets very interesting.. Errors in the coarse resolution blocks are MUCH more visible than errors in the little blocks. OTOH.. it rapidly gets to where it's easier just to EDAC everything. If you're building ASICs or trying to squeeze every last bit per second out of your channel, it's worth it to tailor things. If you're writing software, and the labor for that is the big cost contributor, spending money on blanket EDAC is a no-brainer. Jim Lux
- Previous message: [Beowulf] Not quite Walmart, or, living without ECC?
- Next message: [Beowulf] Opteron
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
