Dual-Athlon Cluster Problems
chris at wehi.edu.au
Wed Jan 22 22:45:40 PST 2003
We're in the process of setting up a new 32-node dual-athlon cluster running
Redhat 7.3, kernel 2.4.18-19.7.smp. The configuration is attached below. We're
having problems with nodes hanging during calculations, sometimes only after
several hours of runtime. We have a serial console connected to such nodes but
that is unable to interact with the nodes once they hang. Nothing is logged
either. It seems that running jobs on one CPU doesn't seem to present too much
of a problem, but when the machines are fully loaded (both CPU's 100%
utilization) errors start to occur and machines die often up to 8 nodes
within 24 hours. Temperature of nodes under full load is approximately 55C.
We have tried using the "noapic" option but the problems still persist. Using
other software not requiring enfuzion 6 also produces the same problems.
The seek feedback on the following:
1/ Are there issues using redhat 7.3 as opposed to 7.2 in such
a setup ?
2/ Are there known issues with 2.4.18 kernels and AMD chips ?
We suspect the problems are kernel related.
3/ Are there any problems with dual-athlon clusters using the
MSI K7D Master L motherboard ?
4/ Are there any other outstanding issues with these machines
under constant heavy load ?
Any advice/help would be greatly appreciated.
Thanks in advance
CPU's: Athlon MP2000+
RAM: 1024Mb Kingston PC2100 DDR
Operating system: Redhat 7.3 (with updates)
Motherboard: MSI K7 Master L motherboard (Award Bios 1.5).
Network: On-board PCI (Ethernet controller: Intel Corp.
82559ER (rev 09)). (Using latest Intel drivers, "no sleep" option set)
CPU single Athlon MP2000+
CPU: single Athlon MP2000 &
Network: PCI Gigabit NIC
cisco 2950 (one GBIC installed)
Cluster management Enfusion 6
Computational Dock V4.0.1
More information about the Beowulf