[Beowulf] Memory stress testing tools.
Prentice Bisbal
prentice at ias.edu
Thu Dec 9 08:08:28 PST 2010
On 12/08/2010 11:47 AM, Jason Clinton wrote:
> On Tue, Dec 7, 2010 at 10:54, Prentice Bisbal <prentice at ias.edu
> <mailto:prentice at ias.edu>> wrote:
>
> Can any of you recommend a good RAM stress testing tool?
>
>
> We have an open source ISO/netboot image that can stress-test using the
> latest Linux kernel EDAC facilities and HPL as the test code. It's
> posted here: http://www.advancedclustering.com/software/breakin.html
>
> It's intended to be booted into.
>
> There's a beta of a slightly newer version posted at:
> http://lab.advancedclustering.com/bootimage/
>
> I would be interested in any feedback you have on either version.
Jason,
I know breakin well. I used it a quite a bit a in 2008 when I was
stress-testing my then-new cluster, and sent some feedback to the
developer at the time (last name Shoemaker, I think). I did find that I
could run it for days on all my cluster nodes, and then a few days
later, when running a HPL as a single job across all the nodes, I'd get
memory errors. I haven't used it since. Not because I don't like it, but
I just haven't had a need for it since then.
I've also been testing this node by running a single HPL job across all
32 cores myself, and even after days of doing this, I couldn't trigger
any errors, but a user program could trigger an error in only a couple
of hours.
Based on these experiences, I don't think that HPL is good at stressing
RAM.Has anyone else had similar experiences?
Since this system has 128 GB of RAM, I think it's a good assumption that
many programs might not use all of that RAM, so I need something memory
specific that I know will hit all 128 GB of RAM.
So far, mprime appears to be working. I was able to trigger an SBE in 21
hours the first time I ran it. I plan on running it repeatedly for the
next few days to see how well it can repeat finding errors.
--
Prentice Bisbal
More information about the Beowulf
mailing list