Problems with dual Athlons

Robert G. Brown rgb at phy.duke.edu
Wed Jul 31 11:34:54 PDT 2002


On Wed, 31 Jul 2002, Steve Gaudet wrote:

> Few things I'd look at memory and cooling.  The MP Athlons I feel must have
> copper core heat sinks with excellent fan match up.  I noticed you didn't
> mention the case.  If its a rackmount make sure there is adequate space
> between the case cover and the fan.  If not this could be the problem.

We monitor case and CPU temperatures.  Ambient air going in is about
58F, coming out about 70F.  I don't think this is an issue any more.

> 
> Look at the memory and verify all the chips are the same.  Some memory chips
> sets don't play well together.

The memory is all AMD/Tyan approved, straight off their list, and we've
swapped around DIMMs from several approved brands.  Of course this
proves nothing -- too small an N, for all that.  About the only thing
one can conclude is that there is SOMETHING marginal about the
engineering of this system -- I've never run a system more temperamental
about environment and configuration, in nearly 20 years of PC and
workstation operations.  Identical configurations -- case, motherboard,
cpu, power supply, memory, network, BIOS version and setup.  One works,
the other just doesn't.  One works when plugged in HERE, but not when
plugged in THERE.  Almost certainly hardware, but something very subtle
-- a timing issue, some sort of noise bleeding through the power supply.
Whatever it is, we've done a lot of hardware exercise testing and gotten
nothing, for the most part.  It isn't even clear that the crashes are
load dependent.  Sometimes an idle system just decides to die.

As I said, if I only had boundless time I'd stick a scope on the supply
wires and see if there is anything interesting out there in the high
high (MHz and up) frequency range that might triggering a crash.

> Might want to try memtest86, can be found at http://www.memtest86.com/
> 
> Another one is http://sourceforge.net/projects/va-ctcs/
> 
> We use ctcs for 72 hour burn in and it works at finding hardware problems.
> 
> Once you verify that the hardware is infact solid.  I'd just reload the
> software from scratch and start over.  In the long run, it's sometimes
> quicker.
> 
> Hope this helps.

Oh, it always helps.  Actually, I'm doing decently with my 2466 cluster
(much better system than the 2460 in terms of stability) although it
still has a bit more "character" than I'd like (he grumbles as he goes
in to reboot a few cranky nodes that are having PXE issues).  The bad
2460's we've reinstalled, rebuilt, and now are RMA'ing the motherboards
to get 2466's in exchange as they crash the sixth or seventh or eleventh
time.  It may be something as simple as quality control or design issues
on the motherboard -- a few traces with inductions that can resonate
enough to create a spurious signal in the logic depending on things like
precisely where nearby metal sits and just how much HF noise there is on
the power, a couple of traces that run too far and too close so a signal
on one can be picked up on the other ditto.  A different layout and the
problems go away.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu






More information about the Beowulf mailing list