[Beowulf] Re: cheap PCs this christmas

Tony Travis ajt at rri.sari.ac.uk
Mon Nov 14 12:07:14 PST 2005


David Mathog wrote:
>> It's not quite as bad as it sounds because, on the basis of simulations 
>> running the "memtester" stress test periodically on nodes in our cluster 
>> we have machines that have been up for over 60 days that are capable of 
>> running 100 passes on 50% of their memory (typically 512MB) without 
>> reporting an error. I'm working on the basis that if the stress test 
>> doesn't give errors then a 'normal' application is unlikely to either.
> 
> There's a slight problem with that argument.  Memtest writes and then
> reads back memory fairly quickly.  It will detect memory errors that
> [...]

Hello, David.

Good point, but I'm not using memtest86, I'm using "memtester":

	http://pyropus.ca/software/memtester/

This is Charles Cazabon's user-mode VM stress test, using mlock() to 
lock memory into 'core' while Linux is running. It's not a stand-alone 
boot-time/burn-in memory test like "memtest86". I also test the swap 
disk separately, but "memtester" doesn't allow the tested memory to be 
swapped unless it runs in 'degraded' mode without mlock() which is NOT 
recommended. The test takes about 50h to run on an Athlon XP 2400+ with 
1GB RAM (512MB of which is actually tested).

All our nodes have already passed memtest86+ which I use to check for 
memory faults before they are connected to the cluster. The nodes then 
have to run 100 passes of "memtester" without error on 50% of their 
memory (the maximum that can be locked by a user process under Linux) 
before being allowed to accept openMosix migrated processes from the 
other nodes in the cluster. I also periodically run "memtester" along 
with 'normal' jobs, as a confidence test, to ensure the cluster is 
working reliably. Having 'weeded' out all the suspect memory, it is now 
running quite reliably. The last time I had to reboot the entire cluster 
was caused by a mains power failure to the whole building.

Best wishes,

	Tony.
-- 
Dr. A.J.Travis,                     |  mailto:ajt at rri.sari.ac.uk
Rowett Research Institute,          |    http://www.rri.sari.ac.uk/~ajt
Greenburn Road, Bucksburn,          |   phone:+44 (0)1224 712751
Aberdeen AB21 9SB, Scotland, UK.    |     fax:+44 (0)1224 716687



More information about the Beowulf mailing list