[Beowulf] cheap PCs this christmas
Tony Travis
ajt at rri.sari.ac.uk
Thu Nov 10 08:04:37 PST 2005
Jim Lux wrote:
> There's a rumor that HP is going to have a $300 PC this Christmsas
> shopping season (as in starting the day after thanksgiving) to be sold
> through massmarket outlets (e.g. Wal-Mart). Presumably this is a real
> $300, not a $1000 PC with $700 in "rebates".
Hello, Jim.
That sounds great BUT what about the reliability of COTS memory?
I built a 64-node Athlon XP 2400+/2600+ cluster here to run openMosix,
and had *terrible* problems with memory reliability on 32 of the nodes
with slimline cases that I bought for £287 each including 1GB RAM, 40GB
IDE disk and 3C2000 PCI Gigabit NIC. Although it sounds like a bargain,
it has taken me a long time to weed out all the 'bad' memory (using
memtest86 and memtester). A particular problem when using openMosix
process migration is that bad RAM on one node can spread process memory
corruption throughout the cluster - I wrote about this on the oM Wiki:
http://howto.x-tend.be/openMosixWiki/index.php/Additions%20to%20the%20FAQ
Now, I don't allow openMosix compute nodes to join the cluster unless
they can run 100 passes of memtester on 50% of their available RAM
without a single error. This might seem a bit OTT but it is, in fact, a
realistic simulation of the way real jobs run on the cluster. We adopted
this strategy because some jobs were producing odd results despite the
fact that ALL the nodes passed memtest86 before being allowed to join
the cluster.
There has been some discussion about the reliability of COTS memory in
space:
http://www.crhc.uiuc.edu/FTCS-29/pdfs/rennelsd.pdf
And an infrastructure for handling memory errors in the Linux kernel:
http://kerneltrap.org/node/5293
I've suggested doing CRC checks on memory transfers during oM process
migration, but this was received with little enthusiasm by the openMosix
community. I think it it's a similar problem to doing CRC checks on disk
transfers myself, and the performance overhead would be acceptable with
100Base-T/Gigabit NIC latency. I thought it might be possible to adapt
Rick Rein's work, but he told me he was doubtful about this:
http://www.linuxjournal.com/article/4489
I think memory reliability represents an Achilles heel for openMosix on
COTS clusters. The economics DIY Beowulf seem a lot less attractive if
you have to use PC's with ECC memory. My present strategy is to subject
nodes to random memory stress tests, and replace memory if any errors
are reported. If a node crashes during normal use it is not allowed to
re-join the cluster until it has run 100 passes of memtester without
error. FYI memtester is at:
http://pyropus.ca/software/memtester/
I'm interested to know about other people's views and experiences of the
reliability of COTS (i.e. non-ECC) memory?
Best wishes,
Tony.
--
Dr. A.J.Travis, | mailto:ajt at rri.sari.ac.uk
Rowett Research Institute, | http://www.rri.sari.ac.uk/~ajt
Greenburn Road, Bucksburn, | phone:+44 (0)1224 712751
Aberdeen AB21 9SB, Scotland, UK. | fax:+44 (0)1224 716687
More information about the Beowulf
mailing list