Dual-Athlon Cluster Problems

Chris Steward chris at wehi.edu.au
Wed Jan 22 22:45:40 PST 2003


Hi,

We're in the process of setting up a new 32-node dual-athlon cluster running
Redhat 7.3, kernel 2.4.18-19.7.smp. The configuration is attached below. We're
having problems with nodes hanging during calculations, sometimes only after
several hours of runtime. We have a serial console connected to such nodes but
that is unable to interact with the nodes once they hang. Nothing is logged
either. It seems that running jobs on one CPU doesn't seem to present too much
of a problem, but when the machines are fully loaded (both CPU's 100%
utilization) errors start to occur and machines die – often up to 8 nodes
within 24 hours. Temperature of nodes under full load is approximately 55C.
We have tried using the "noapic" option but the problems still persist.  Using
other software not requiring enfuzion 6 also produces the same problems.

The seek feedback on the following:

1/ Are there issues using redhat 7.3 as opposed to 7.2 in such
   a setup ?

2/ Are there known issues with 2.4.18 kernels and AMD chips ?
   We suspect the problems are kernel related.

3/ Are there any problems with dual-athlon clusters using the
   MSI K7D Master L motherboard ?

4/ Are there any other outstanding issues with these machines 
   under constant heavy load ?

Any advice/help would be greatly appreciated.

Thanks in advance

Chris

--------------------------------------------------------------
Cluster configuration

node configuration:

CPU's:                   Athlon MP2000+
RAM:                  	  1024Mb Kingston PC2100 DDR
Operating system:     	  Redhat 7.3 (with updates)
Kernel:                  2.4.18-19.7.xsmp
Motherboard:             MSI K7 Master L motherboard (Award Bios 1.5).
Network:                 On-board PCI (Ethernet controller: Intel Corp.
82559ER (rev 09)). (Using latest Intel drivers, "no sleep" option set)

head-node:

CPU 			   single Athlon MP2000+

Dataserver:

CPU: 			  single Athlon MP2000 &
Network:		  PCI Gigabit NIC

Network Interconnect:

cisco 2950 (one GBIC installed)

Software:

Cluster management	Enfusion 6
Computational		Dock V4.0.1







More information about the Beowulf mailing list