[Beowulf] Re: cheap PCs this christmas

Mon Nov 14 10:32:07 PST 2005

David Mathog wrote:
 > Tony Travis <ajt at rri.sari.ac.uk> wrote
 >
 >> That sounds great BUT what about the reliability of COTS memory?
 > <SNIP>
 >> The economics DIY Beowulf seem a lot less attractive if you have to 
use PC's with ECC memory.
 >
 > If you expect the memory to stay error free for any length of time it
 > must be resistant to various memory failures, including those caused by
 > gamma rays and other background radiation. ECC gives you that, regular
 > memory does not.  There is a reason servers use ECC memory!

Hello, David.

Yes, I know why servers use ECC memory(!) but there are quite a lot of 
people (me included) who have built Beowulf clusters out of the sort of 
COTS hardware used in the $160 HP PC "Christmas Special" mentioned at 
the start of this thread - this type of PC is very unlikely use ECC 
memory...

 > That said, most consumer grade PCs are still useful with their not so
 > great memory because they are reset periodically, typically at least
 > once a day, and that clears out the memory errors.  If random errors
 > occur at "Erate" (errors per unit time per machine, all identical
 > machines) you could keep the total number of errors per machine, on
 > average, below MaxE by rebooting the entire cluster at time MaxE/Erate.
 >  However, OpenMosix is going to make a hash of that simple model since,
 > as you described, it replicates memory errors across the nodes.  Other
 > options include running jobs in duplicate and if a discrepancy is found,
 > running a third instance to break the tie.

We are, perhaps, looking at this differently: I'm not really trying to 
do software ECC, I just want to know *if* an error has occurred because 
that is cheaper and simpler to do. In the (good!) old days, we used one 
bit of memory to detect parity errors. Then, at least, you knew you'd 
got to re-run a job even if you couldn't correct the error.

The problem with modern non-ECC COTS memory is that it doesn't even have 
parity checking because that slows memory access down and reduces the 
total memory capacity of a chip. I think there is some prospect of using 
software to detect COTS memory errors. This is of interest to projects 
involved in using large amounts of COTS memory instead of hard disks to 
store data in space. I just wonder if anyone can come up with similar 
ways of detecting COTS memory errors down here on the ground...

I think you are probably right that the simplest solution is to run jobs 
twice to confirm the results, and this is one of the strategies proposed 
for 'massively' parallel computing in 'space' too. However, if memory 
errors corrupt the openMosix kernel then you get into BIG trouble!

It's not quite as bad as it sounds because, on the basis of simulations 
running the "memtester" stress test periodically on nodes in our cluster 
we have machines that have been up for over 60 days that are capable of 
running 100 passes on 50% of their memory (typically 512MB) without 
reporting an error. I'm working on the basis that if the stress test 
doesn't give errors then a 'normal' application is unlikely to either.

Best wishes,

     Tony.
-- 
Dr. A.J.Travis,                     |  mailto:ajt at rri.sari.ac.uk
Rowett Research Institute,          |    http://www.rri.sari.ac.uk/~ajt
Greenburn Road, Bucksburn,          |   phone:+44 (0)1224 712751
Aberdeen AB21 9SB, Scotland, UK.    |     fax:+44 (0)1224 716687