[Beowulf] Re: cheap PCs this christmas
Tony Travis
ajt at rri.sari.ac.uk
Mon Nov 14 10:32:07 PST 2005
David Mathog wrote:
> Tony Travis <ajt at rri.sari.ac.uk> wrote
>
>> That sounds great BUT what about the reliability of COTS memory?
> <SNIP>
>> The economics DIY Beowulf seem a lot less attractive if you have to
use PC's with ECC memory.
>
> If you expect the memory to stay error free for any length of time it
> must be resistant to various memory failures, including those caused by
> gamma rays and other background radiation. ECC gives you that, regular
> memory does not. There is a reason servers use ECC memory!
Hello, David.
Yes, I know why servers use ECC memory(!) but there are quite a lot of
people (me included) who have built Beowulf clusters out of the sort of
COTS hardware used in the $160 HP PC "Christmas Special" mentioned at
the start of this thread - this type of PC is very unlikely use ECC
memory...
> That said, most consumer grade PCs are still useful with their not so
> great memory because they are reset periodically, typically at least
> once a day, and that clears out the memory errors. If random errors
> occur at "Erate" (errors per unit time per machine, all identical
> machines) you could keep the total number of errors per machine, on
> average, below MaxE by rebooting the entire cluster at time MaxE/Erate.
> However, OpenMosix is going to make a hash of that simple model since,
> as you described, it replicates memory errors across the nodes. Other
> options include running jobs in duplicate and if a discrepancy is found,
> running a third instance to break the tie.
We are, perhaps, looking at this differently: I'm not really trying to
do software ECC, I just want to know *if* an error has occurred because
that is cheaper and simpler to do. In the (good!) old days, we used one
bit of memory to detect parity errors. Then, at least, you knew you'd
got to re-run a job even if you couldn't correct the error.
The problem with modern non-ECC COTS memory is that it doesn't even have
parity checking because that slows memory access down and reduces the
total memory capacity of a chip. I think there is some prospect of using
software to detect COTS memory errors. This is of interest to projects
involved in using large amounts of COTS memory instead of hard disks to
store data in space. I just wonder if anyone can come up with similar
ways of detecting COTS memory errors down here on the ground...
I think you are probably right that the simplest solution is to run jobs
twice to confirm the results, and this is one of the strategies proposed
for 'massively' parallel computing in 'space' too. However, if memory
errors corrupt the openMosix kernel then you get into BIG trouble!
It's not quite as bad as it sounds because, on the basis of simulations
running the "memtester" stress test periodically on nodes in our cluster
we have machines that have been up for over 60 days that are capable of
running 100 passes on 50% of their memory (typically 512MB) without
reporting an error. I'm working on the basis that if the stress test
doesn't give errors then a 'normal' application is unlikely to either.
Best wishes,
Tony.
--
Dr. A.J.Travis, | mailto:ajt at rri.sari.ac.uk
Rowett Research Institute, | http://www.rri.sari.ac.uk/~ajt
Greenburn Road, Bucksburn, | phone:+44 (0)1224 712751
Aberdeen AB21 9SB, Scotland, UK. | fax:+44 (0)1224 716687
More information about the Beowulf
mailing list