[Beowulf] Stress / torture test cluster hardware
Andrew Shewmaker
agshew at gmail.com
Sat Oct 7 21:26:52 PDT 2006
On 10/7/06, Nico Mittenzwey <nico.mittenzwey at s2001.tu-chemnitz.de> wrote:
> "memtest86" http://www.memtest86.com/
If you are using a large amount of ECC memory, you may find it
necessary to keep track of Single Bit Errors and look for "weak"
DIMMs using something like the EDAC/bluesmoke drivers
(http://bluesmoke.sourceforge.net) and a userspace memory
tester.
On a 264 node cluster with 8-16GB RAM, I had to weed out
weak memory over a period of months. A given node
running a memory tester would show no SBEs for a day or
more, then suddenly show a huge burst. Another system
might have a more consistently incrementing SBE counter.
Now, ECC was working, so applications like the memory
tester weren't having problems. However, I couldn't reliably
reboot this cluster because the BIOS would often refuse to
boot unless a node was powered off for say, five minutes.
I wrote up more of this experience on the Real World Tech
forum:
http://www.realworldtech.com/forums/index.cfm?action=detail&id=69894&threadid=69639&roomid=11
It looks like the latest version of Stresslinux has a 2.6.16.18
kernel, so it should have the EDAC drivers included. Plus,
it has the userspace memtester. Memtest86 is nice, but it
didn't support checking the ECC counters on the the cluster
I mention above. It couldn't help me weed out DIMMs at
all.
See http://agenda.clustermonkey.net/index.php/Memory
for some more info about this (links to LWN articles and a
list of supported drivers in 2.6.16).
I wasn't aware of the EDAC wiki until I saw it linked
from the bluesmoke page just now. It will tell you
what chipset support is coming.
http://buttersideup.com/edacwiki/
I would be interested to hear about other what kind of
single bit error rates other people see on their clusters.
--
Andrew Shewmaker
More information about the Beowulf
mailing list