[Beowulf] Memory stress testing tools

David Mathog mathog at caltech.edu
Fri Dec 10 13:18:02 PST 2010


Prentice Bisbal wrote:
 
> The server is a Dell PowerEdge R815 with 4 8-Core AMD processors and 128 
> GB of RAM.

If the erroneous memory locations are moving around in memory without
correlation to the DIMMs then the next most likely culprits are a
marginal power supply, CPU, or motherboard, in pretty much that order.
(OK, kind of a toss up for CPU vs. motherboard, but since you have 32
cores in the system I put it first.)

If you have access to an oscilloscope look closely at the voltages on
the two machines.  No need to cut in anywhere, just measure +5 and +12V
on an unused disk or fan connector.   If the machine prone to memory
errors is significantly noisier than the one that is not, that could be
the problem.  I have seen this exactly once - all PS testers said it was
good, and a multimeter had it pegged at the right voltages, but there
was a ton of high frequency noise coming out of the power supply.  

If you can disable CPUs through the BIOS on that machine, running for a
while under each CPU alone might narrow the issue down to 1 of the 4. 
You wouldn't be done then though, because it could be the socket and not
the CPU itself.  Still, if you can get it down to 1 CPU then you could
swap that with another and see if the issue moves with it.

You probably already did this, but be sure both machines have the same
BIOS release.


Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the Beowulf mailing list