[Beowulf] Memory stress testing tools.

Prentice Bisbal prentice at ias.edu
Fri Dec 10 06:24:30 PST 2010


David,

Thanks for the e-mail due to it's length, I'm not including it in my 
reply, which I know is normally bad mailing list etiquette.

The server is a Dell PowerEdge R815 with 4 8-Core AMD processors and 128 
GB of RAM.

I installed two identical servers at the same time, named frigga and 
odin (husband and wife in Norse mythology, if your curious). These nodes 
are not part of a beowulf cluster, but this is the best forum I know of 
to discuss problems like this.

Odin is the system with errors, and it started reporting SBE errors 
almost immediately, even when the system was completely idle. They 
started within hours of operating system installation, before users were 
even able to login to the system.

As you pointed out, I don't think SBE errors are fatal, but I like to 
address all system errors I identify, no matter how trivial. I find when 
you get used to ignoring a "harmless" errors, you eventually end up 
ignoring all errors.

So, you are right that I'm looking for a tool to quickly and reliably 
reproduce SBEs so that I can quickly resolve this problem with Dell. For 
reasons I can't discuss here, working with the user is not an option. 
Due to the nature of my institution, users are only here for a couple of 
years, anyway, and I'm looking for a tool that I can use long after this 
user (and his code) are gone.

I have been keeping detailed logs of exactly when the SBE errors occur. 
And I have been reseating and swapping DIMMS to see of the errors move 
with the DIMM or stay with the slot to determine whether it's a bad 
DIMM, or a bad motherboard. In the first occasion, the error did move 
with the DIMM, and I replaced the DIMM. Since then, the errors have been 
moving from DIMM to DIMM, even across banks of DIMMS. Since each bank 
corresponds to a socket, this would indicate that it's not a bad on-chip 
memory controller, or they're all bad.

My goal is to find a tool that I can run repeatedly to reproduce SBE 
errors in a finite time frame, and then run it repeatedly and collect 
data on where these SBEs occur. I suspect it's a bad motherboard, but 
unless I have overwhelming data showing that, Dell will just keep 
replacing the DIMMs, and I'm pretty confident it's not bad DIMMs in this 
case.

As stated earlier, HPL wasn't reliable for me in this capacity. I'm now 
using mprime's stress test mode, and will also test stressapptest.


-- 
Prentice




More information about the Beowulf mailing list