[Beowulf] Not quite Walmart, or, living without ECC?

Tony Travis ajt at rri.sari.ac.uk
Fri Nov 16 08:51:33 PST 2007


David Mathog wrote:
> [...]
> Any of you running clusters without ECC?  Has the lack of error
> correction been a problem?

Hello, David.

Yes, I'm running openMosix on 64 at Athlon2400+/2600+ 1p compute nodes. I 
posted this on the openMosix Wiki about it:

http://howto.krisbuytaert.be/openMosixWiki/index.php/Additions_to_the_FAQ

'Q.' How reliable is openMosix?

'A.' An openMosix cluster is only as reliable as its "least" reliable 
node: In particular, memory corruption can be propagated throughout a 
cluster if processes are migrated to and from an unreliable COTS 
(Commodity Off The Shelf) PC without ECC (Error Correction Code) memory. 
If the memory corruption is sufficient to make a migrated process crash, 
the load on the unreliable node then decreases and more processes are 
"attracted" to the node from the rest of the cluster by the openMosix 
load balancing algorithm. Migrated processes that do not crash on the 
node may also be corrupted if they make use of unreliable memory. When 
these processes are migrated away from the unreliable node memory 
corruption is propagated back to the rest of the openMosix cluster. For 
this reason, it is essential to test the memory of COTS PC's thoroughly 
BEFORE allowing them to join an openMosix cluster. This can be done 
using a stand-alone utility e.g. "memtest86" (http://www.memtest86.com/) 
or under Linux with a user-mode utility e.g. "memtester" 
(http://pyropus.ca/software/memtester/).

Best wishes,

	Tony.
-- 
Dr. A.J.Travis,                     |  mailto:ajt at rri.sari.ac.uk
Rowett Research Institute,          |    http://www.rri.sari.ac.uk/~ajt
Greenburn Road, Bucksburn,          |   phone:+44 (0)1224 712751
Aberdeen AB21 9SB, Scotland, UK.    |     fax:+44 (0)1224 716687



More information about the Beowulf mailing list