Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Stress / torture test cluster hardware

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Andrew Shewmaker agshew at gmail.com
Sat Oct 7 21:26:52 PDT 2006


On 10/7/06, Nico Mittenzwey <nico.mittenzwey at s2001.tu-chemnitz.de> wrote:

> "memtest86" http://www.memtest86.com/

If you are using a large amount of ECC memory, you may find it
necessary to keep track of Single Bit Errors and look for "weak"
DIMMs using something like the EDAC/bluesmoke drivers
(http://bluesmoke.sourceforge.net) and a userspace memory
tester.

On a 264 node cluster with 8-16GB RAM, I had to weed out
weak memory over a period of months.  A given node
running a memory tester would show no SBEs for a day or
more, then suddenly show a huge burst.  Another system
might have a more consistently incrementing SBE counter.
Now, ECC was working, so applications like the memory
tester weren't having problems.  However, I couldn't reliably
reboot this cluster because the BIOS would often refuse to
boot unless a node was powered off for say, five minutes.

I wrote up more of this experience on the Real World Tech
forum:

http://www.realworldtech.com/forums/index.cfm?action=detail&id=69894&threadid=69639&roomid=11

It looks like the latest version of Stresslinux has a 2.6.16.18
kernel, so it should have the EDAC drivers included.  Plus,
it has the userspace memtester.  Memtest86 is nice, but it
didn't support checking the ECC counters on the the cluster
I mention above.  It couldn't help me weed out DIMMs at
all.

See http://agenda.clustermonkey.net/index.php/Memory
for some more info about this (links to LWN articles and a
list of supported drivers in 2.6.16).

I wasn't aware of the EDAC wiki until I saw it linked
from the bluesmoke page just now.  It will tell you
what chipset support is coming.

http://buttersideup.com/edacwiki/

I would be interested to hear about other what kind of
single bit error rates other people see on their clusters.

-- 
Andrew Shewmaker



More information about the Beowulf mailing list