[Beowulf] Re: cheap PCs this christmas
Tony Travis
ajt at rri.sari.ac.uk
Mon Nov 14 12:07:14 PST 2005
David Mathog wrote:
>> It's not quite as bad as it sounds because, on the basis of simulations
>> running the "memtester" stress test periodically on nodes in our cluster
>> we have machines that have been up for over 60 days that are capable of
>> running 100 passes on 50% of their memory (typically 512MB) without
>> reporting an error. I'm working on the basis that if the stress test
>> doesn't give errors then a 'normal' application is unlikely to either.
>
> There's a slight problem with that argument. Memtest writes and then
> reads back memory fairly quickly. It will detect memory errors that
> [...]
Hello, David.
Good point, but I'm not using memtest86, I'm using "memtester":
http://pyropus.ca/software/memtester/
This is Charles Cazabon's user-mode VM stress test, using mlock() to
lock memory into 'core' while Linux is running. It's not a stand-alone
boot-time/burn-in memory test like "memtest86". I also test the swap
disk separately, but "memtester" doesn't allow the tested memory to be
swapped unless it runs in 'degraded' mode without mlock() which is NOT
recommended. The test takes about 50h to run on an Athlon XP 2400+ with
1GB RAM (512MB of which is actually tested).
All our nodes have already passed memtest86+ which I use to check for
memory faults before they are connected to the cluster. The nodes then
have to run 100 passes of "memtester" without error on 50% of their
memory (the maximum that can be locked by a user process under Linux)
before being allowed to accept openMosix migrated processes from the
other nodes in the cluster. I also periodically run "memtester" along
with 'normal' jobs, as a confidence test, to ensure the cluster is
working reliably. Having 'weeded' out all the suspect memory, it is now
running quite reliably. The last time I had to reboot the entire cluster
was caused by a mains power failure to the whole building.
Best wishes,
Tony.
--
Dr. A.J.Travis, | mailto:ajt at rri.sari.ac.uk
Rowett Research Institute, | http://www.rri.sari.ac.uk/~ajt
Greenburn Road, Bucksburn, | phone:+44 (0)1224 712751
Aberdeen AB21 9SB, Scotland, UK. | fax:+44 (0)1224 716687
More information about the Beowulf
mailing list