[Beowulf] Memory stress testing tools.

Prentice Bisbal prentice at ias.edu
Thu Dec 9 08:08:28 PST 2010

On 12/08/2010 11:47 AM, Jason Clinton wrote:
> On Tue, Dec 7, 2010 at 10:54, Prentice Bisbal <prentice at ias.edu
> <mailto:prentice at ias.edu>> wrote:
>     Can any of you recommend a good RAM stress testing tool?
> We have an open source ISO/netboot image that can stress-test using the
> latest Linux kernel EDAC facilities and HPL as the test code. It's
> posted here: http://www.advancedclustering.com/software/breakin.html
> It's intended to be booted into.
> There's a beta of a slightly newer version posted at:
> http://lab.advancedclustering.com/bootimage/
> I would be interested in any feedback you have on either version.


I know breakin well. I used it a quite a bit a in 2008 when I was 
stress-testing my then-new cluster, and sent some feedback to the 
developer at the time (last name Shoemaker, I think).  I did find that I 
could run it for days on all my cluster nodes, and then a few days 
later, when running a HPL as a single job across all the nodes, I'd get 
memory errors. I haven't used it since. Not because I don't like it, but 
I just haven't had a need for it since then.

I've also been testing this node by running a single HPL job across all 
32 cores myself, and even after days of doing this, I couldn't trigger 
any errors, but a user program could trigger an error in only a couple 
of hours.

Based on these experiences, I don't think that HPL is good at stressing 
RAM.Has anyone else had similar experiences?

Since this system has 128 GB of RAM, I think it's a good assumption that 
many programs might not use all of that RAM, so I need something memory 
specific that I know will hit all 128 GB of RAM.

So far, mprime appears to be working. I was able to trigger an SBE in 21 
hours the first time I ran it.  I plan on running it repeatedly for the 
next few days to see how well it can repeat finding errors.

Prentice Bisbal

More information about the Beowulf mailing list