[Beowulf] Re: cheap PCs this christmas

David Mathog mathog at mendel.bio.caltech.edu
Mon Nov 14 08:46:51 PST 2005

Tony Travis <ajt at rri.sari.ac.uk> wrote

> That sounds great BUT what about the reliability of COTS memory?
> The economics DIY Beowulf seem a lot less attractive if 
> you have to use PC's with ECC memory.

If you expect the memory to stay error free for any length of time it
must be resistant to various memory failures, including those caused by
gamma rays and other background radiation. ECC gives you that, regular
memory does not.  There is a reason servers use ECC memory!

That said, most consumer grade PCs are still useful with their not so
great memory because they are reset periodically, typically at least
once a day, and that clears out the memory errors.  If random errors
occur at "Erate" (errors per unit time per machine, all identical
machines) you could keep the total number of errors per machine, on
average, below MaxE by rebooting the entire cluster at time MaxE/Erate.
 However, OpenMosix is going to make a hash of that simple model since,
as you described, it replicates memory errors across the nodes.  Other
options include running jobs in duplicate and if a discrepancy is found,
running a third instance to break the tie.


David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

