[Beowulf] cheap PCs this christmas

Thu Nov 10 08:04:37 PST 2005

Jim Lux wrote:
 > There's a rumor that HP is going to have a $300 PC this Christmsas
 > shopping season (as in starting the day after thanksgiving) to be sold
 > through massmarket outlets (e.g. Wal-Mart).  Presumably this is a real
 > $300, not a $1000 PC with $700 in "rebates".

Hello, Jim.

That sounds great BUT what about the reliability of COTS memory?

I built a 64-node Athlon XP 2400+/2600+ cluster here to run openMosix, 
and had *terrible* problems with memory reliability on 32 of the nodes 
with slimline cases that I bought for £287 each including 1GB RAM, 40GB 
IDE disk and 3C2000 PCI Gigabit NIC. Although it sounds like a bargain, 
it has taken me a long time to weed out all the 'bad' memory (using 
memtest86 and memtester). A particular problem when using openMosix 
process migration is that bad RAM on one node can spread process memory 
corruption throughout the cluster - I wrote about this on the oM Wiki:

http://howto.x-tend.be/openMosixWiki/index.php/Additions%20to%20the%20FAQ

Now, I don't allow openMosix compute nodes to join the cluster unless 
they can run 100 passes of memtester on 50% of their available RAM 
without a single error. This might seem a bit OTT but it is, in fact, a 
realistic simulation of the way real jobs run on the cluster. We adopted 
this strategy because some jobs were producing odd results despite the 
fact that ALL the nodes passed memtest86 before being allowed to join 
the cluster.

There has been some discussion about the reliability of COTS memory in 
space:

	http://www.crhc.uiuc.edu/FTCS-29/pdfs/rennelsd.pdf

And an infrastructure for handling memory errors in the Linux kernel:

	http://kerneltrap.org/node/5293

I've suggested doing CRC checks on memory transfers during oM process 
migration, but this was received with little enthusiasm by the openMosix 
community. I think it it's a similar problem to doing CRC checks on disk 
transfers myself, and the performance overhead would be acceptable with 
100Base-T/Gigabit NIC latency. I thought it might be possible to adapt 
Rick Rein's work, but he told me he was doubtful about this:

	http://www.linuxjournal.com/article/4489

I think memory reliability represents an Achilles heel for openMosix on 
COTS clusters. The economics DIY Beowulf seem a lot less attractive if 
you have to use PC's with ECC memory. My present strategy is to subject 
nodes to random memory stress tests, and replace memory if any errors 
are reported. If a node crashes during normal use it is not allowed to 
re-join the cluster until it has run 100 passes of memtester without 
error. FYI memtester is at:

	http://pyropus.ca/software/memtester/

I'm interested to know about other people's views and experiences of the 
reliability of COTS (i.e. non-ECC) memory?

Best wishes,

	Tony.
-- 
Dr. A.J.Travis,                     |  mailto:ajt at rri.sari.ac.uk
Rowett Research Institute,          |    http://www.rri.sari.ac.uk/~ajt
Greenburn Road, Bucksburn,          |   phone:+44 (0)1224 712751
Aberdeen AB21 9SB, Scotland, UK.    |     fax:+44 (0)1224 716687