Dual-Athlon Cluster Problems

Thu Jan 23 09:20:02 PST 2003

On Thu, 23 Jan 2003, Chris Steward wrote:

> Hi,
> 
> We're in the process of setting up a new 32-node dual-athlon cluster running
> Redhat 7.3, kernel 2.4.18-19.7.smp. The configuration is attached below. We're
> having problems with nodes hanging during calculations, sometimes only after
> several hours of runtime. We have a serial console connected to such nodes but
> that is unable to interact with the nodes once they hang. Nothing is logged
> either. It seems that running jobs on one CPU doesn't seem to present too much
> of a problem, but when the machines are fully loaded (both CPU's 100%
> utilization) errors start to occur and machines die – often up to 8 nodes
> within 24 hours. Temperature of nodes under full load is approximately 55C.
> We have tried using the "noapic" option but the problems still persist.  Using
> other software not requiring enfuzion 6 also produces the same problems.

I'd suspect the following (in order):

  Heat.  Make them colder.  They'll like it.  This may involve upgrading
internal fans and setting the temperature down in the room (55C
notwithstanding).  Athlons just hate heat, and a failure under load
always makes one wonder...

  Electricity.  See list archives for our nightmarish experience with a
carelessly wired machine room.  In particular, multiphase node supply
lines should NOT share a common neutral.  Try spreading the load out
some.

  BIOS.  There are good BIOS and bad BIOS (you don't say what
motherboard, but I think this is true pretty much across the board).
Sometimes just reflashing the bios to the latest update fixes
everything.  Sometimes flashing back FROM the latest update fixes
everything.  Go figure.

  Weirder problems associated with specific node configuration.  We had
problems with particular risers.  We had problems with specific video
cards.  These are the problems that are sometimes fixed with the noapic
option.  Sometimes.  Try cycling your hardware about a bit, if you can,
to see if some particular card is a problem.

  AMD certifies power supplies for these motherboards. They mean it.
If a vendor tried to convince you that "any old" power supply would
work, that would be incorrect.  It might or might not "work", but will
it work stably?  The dual Athlons are a bit touchy about power (see
above).

  AMD and the motherboard vendor certifies memory for these
motherboards.  Again, they mean it.  Typically certified ECC DDR, high
end memory.

> The seek feedback on the following:
> 
> 1/ Are there issues using redhat 7.3 as opposed to 7.2 in such
>    a setup ?

Not to my experience.

> 2/ Are there known issues with 2.4.18 kernels and AMD chips ?
>    We suspect the problems are kernel related.

Again, not to my experience.  As in we never had any of our problems
magically go away when changing 2.4.x kernels (in spite of our hopes).

> 3/ Are there any problems with dual-athlon clusters using the
>    MSI K7D Master L motherboard ?

I don't know (we opted for the Tyans) but I wouldn't be surprised.  Some
of the problems may derive from the AMD 76x chipset itself; others may
come from

> 4/ Are there any other outstanding issues with these machines 
>    under constant heavy load ?

See above.  We've had very similar problems with dual Athlons (Tyan 2466
and especially 2460 motherboards).  We have finally approximately
stabilized our cluster, but it wasn't easy and we still have a
smattering of "random" crashes.

Fortunately, they are very productive when they are finally running
well.

We have also observed a weak dependence on the particular jobs being run
on the nodes.  Some people run jobs that seem to initiate a crash while
others do not.  It's not obvious why -- the cpu and so forth can be 100%
loaded either way.  Possibly related to the compiler used (fortran code
bad, gcc good?), possibly related to memory used or something "done" in
the code -- too difficult to run down.

  HTH,

    rgb

> 
> Any advice/help would be greatly appreciated.
> 
> Thanks in advance
> 
> Chris
> 
> --------------------------------------------------------------
> Cluster configuration
> 
> node configuration:
> 
> CPU's:                   Athlon MP2000+
> RAM:                  	  1024Mb Kingston PC2100 DDR
> Operating system:     	  Redhat 7.3 (with updates)
> Kernel:                  2.4.18-19.7.xsmp
> Motherboard:             MSI K7 Master L motherboard (Award Bios 1.5).
> Network:                 On-board PCI (Ethernet controller: Intel Corp.
> 82559ER (rev 09)). (Using latest Intel drivers, "no sleep" option set)
> 
> head-node:
> 
> CPU 			   single Athlon MP2000+
> 
> Dataserver:
> 
> CPU: 			  single Athlon MP2000 &
> Network:		  PCI Gigabit NIC
> 
> Network Interconnect:
> 
> cisco 2950 (one GBIC installed)
> 
> Software:
> 
> Cluster management	Enfusion 6
> Computational		Dock V4.0.1
> 
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu