[Beowulf] Memory stress testing tools.

Thu Dec 9 13:51:44 PST 2010

Jason Clinton wrote:
> On Thu, Dec 9, 2010 at 10:08, Prentice Bisbal <prentice at ias.edu
> <mailto:prentice at ias.edu>> wrote:
> 
>     I know breakin well. I used it a quite a bit a in 2008 when I was
>     stress-testing my then-new cluster, and sent some feedback to the
>     developer at the time (last name Shoemaker, I think).  I did find that I
>     could run it for days on all my cluster nodes, and then a few days
>     later, when running a HPL as a single job across all the nodes, I'd get
>     memory errors. I haven't used it since. Not because I don't like it, but
>     I just haven't had a need for it since then.
> 
> 
> Hum. It's possible that EDAC support for your chipset didn't exist at
> the time. AMD and Intel have been pretty good about landing EDAC for
> their chips in vanilla upstream kernels for the past year and so that is
> why it is important to use a recent kernel. Or at least one with recent
> backports of that work.

At the time, I was using the latest version of Breakin available. I was
testing on AMD Barcelona processors. I was using Breakin in
September/October 2008, and the Barcelona processors came out in March -
May of that year. I would assume that would be enough time for support
for the new processors to trickle down to breakin, but that's just an
assumption, I can't confirm/prove that.

> 
> 
>     I've also been testing this node by running a single HPL job across all
>     32 cores myself, and even after days of doing this, I couldn't trigger
>     any errors, but a user program could trigger an error in only a couple
>     of hours.
> 
>     Based on these experiences, I don't think that HPL is good at stressing
>     RAM.Has anyone else had similar experiences?
> 
> 
> HPL is among the most memory intensive workloads out there. This is why
> architectural changes in the past few years that have increased the
> aggregate memory bandwidth of the architecture have resulted in higher
> measured platform efficiency.
> 
> My guess would be that the difference you've seen between the two would
> be statistical noise. How are you measuring errors? MCE events?

I don't think this is statistical noise. This system has consistently
reported SBE errors  since it was installed several months ago. I've
probably tried to trigger SBEs with HPL dozens of times. I'll often run
it 2-3 times in a row without triggering errors over a period of several
days. When the users go back to using this server, they usually trigger
errors in less time than that. I think HPL resulted in triggering the
error only a couple of times.

The system is a Dell PowerEdge something or other. It has an LCD display
that is normally blue. When hardware error is detected, it turns orange,
and shows the error. I check that several times a day. Our central log
server also e-mails any ciritical log errors that get sent to it, so
even if I didn't check the display on the front of the server, I'll
receive an e-mail shortly after the error is logged in my system logs.

It's low tech, but it works.
> 
>  
> 
>     Since this system has 128 GB of RAM, I think it's a good assumption that
>     many programs might not use all of that RAM, so I need something memory
>     specific that I know will hit all 128 GB of RAM.
> 
> 
> Breakin uses the same algorithm at
> http://www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html
> to calculate the "N" size which will consume 90% of the RAM of a system
> using all cores (in as close to square grid as possible).
> 
>  
> 
>     So far, mprime appears to be working. I was able to trigger an SBE in 21
>     hours the first time I ran it.  I plan on running it repeatedly for the
>     next few days to see how well it can repeat finding errors.
> 
> 
>  I'm curious what kernel you're running that is giving you EDAC
> reporting. Or are you rebooting after an MCE and examining the system
> event logs?
> 
> 
> -- 
> Jason D. Clinton, Advanced Clustering Technologies
> 913-643-0306

-- 
Prentice