Dual-Athlon Cluster Problems
Ben Ransom
bransom at ucdavis.edu
Sun Jan 26 22:13:09 PST 2003
We have had a 20 node dual athlon cluster up (sort of) since June 2002, and
have experienced more problems than I anticipated. I was a newbie at that
time. We didn't get down to trying multi-day runs until 5-6 months after
new, at which time problems began to appear. At first we wondered if our
problems were because of long runs, or just a coincidence.
I agree, it can be very difficult to pin down the source of troubles. Our
first discovery was that we had a batch of bad cpu cooling fans. These
were AMD fans and we were told that AMD had used a different fan or bearing
supplier sometime in early 2002. Anyway, they shipped us 40 new fans and
we (ugh) replaced them all. When opening up all nodes to replace fans, I
would estimate over half were showing some degree of trouble (vibration,
speed, etc). After this episode, we worked up to confidence in cooling and
power supply by running code with ethernet MPI on all nodes (different 100%
cpu runs on groups of 4 nodes) for at least 48 hours.
As all this was settling out, we were still having trouble getting some of
our code to run on more than 4 nodes with Dolphin SCI MPI. We had a bad
dolphin card, and were delayed in getting that figured out and replaced
over Xmas break. Unfortunately, that hasn't been the end of it. We are
still unable to run our primary code on the Dolphin, and yet somewhat
confident in the rest of the cluster from the successful ethernet runs.
One of the reasons we chose Dolphin over Myrinet was thinking that we'd
avoid the single point of failure in a Myrinet switch. This was bad
judgement, as we now know that a bad card (or cable) in our Dolphin setup
not only crashes a random node's kernel during a run with high message
passing, but as well, it seemingly prevents us from even launching that
code on other isolated dolphin rings, i.e. rings that don't include a node
with suspect SCI card. Isolating which card or cable is bad requires time,
and experience ...which I am gaining ;/ .
It is still curious to me, that we can run other codes on Dolphin SCI and
showing 100% cpu utilization (full power/heat) on a ring away from a
suspect SCI card. This implies the reliability is code dependant, as other
have alluded. I spose this may be due to the amount of message
passing? Hopefully the problem will disappear once we get our full Dolphin
set in working order.
BTW, we are using dual 1800+ MP athlons on Tyan S2466 motherboard/ 760MX
chipset, Redhat 7.2 with 2.4.18smp, and as suggested above, I think
everything is probably fine with this.
-Ben Ransom
UC Davis, Mech Engineering Dept
At 05:45 PM 1/23/2003 +1100, Chris Steward wrote:
>Hi,
>
>We're in the process of setting up a new 32-node dual-athlon cluster running
>Redhat 7.3, kernel 2.4.18-19.7.smp. The configuration is attached below. We're
>having problems with nodes hanging during calculations, sometimes only after
>several hours of runtime. We have a serial console connected to such nodes but
>that is unable to interact with the nodes once they hang. Nothing is logged
>either. It seems that running jobs on one CPU doesn't seem to present too much
>of a problem, but when the machines are fully loaded (both CPU's 100%
>utilization) errors start to occur and machines die often up to 8 nodes
>within 24 hours. Temperature of nodes under full load is approximately 55C.
>We have tried using the "noapic" option but the problems still persist. Using
>other software not requiring enfuzion 6 also produces the same problems.
>
>The seek feedback on the following:
>
>1/ Are there issues using redhat 7.3 as opposed to 7.2 in such
> a setup ?
>
>2/ Are there known issues with 2.4.18 kernels and AMD chips ?
> We suspect the problems are kernel related.
>
>3/ Are there any problems with dual-athlon clusters using the
> MSI K7D Master L motherboard ?
>
>4/ Are there any other outstanding issues with these machines
> under constant heavy load ?
>
>Any advice/help would be greatly appreciated.
>
>Thanks in advance
>
>Chris
>
>--------------------------------------------------------------
>Cluster configuration
>
>node configuration:
>
>CPU's: Athlon MP2000+
>RAM: 1024Mb Kingston PC2100 DDR
>Operating system: Redhat 7.3 (with updates)
>Kernel: 2.4.18-19.7.xsmp
>Motherboard: MSI K7 Master L motherboard (Award Bios 1.5).
>Network: On-board PCI (Ethernet controller: Intel Corp.
>82559ER (rev 09)). (Using latest Intel drivers, "no sleep" option set)
>
>head-node:
>
>CPU single Athlon MP2000+
>
>Dataserver:
>
>CPU: single Athlon MP2000 &
>Network: PCI Gigabit NIC
>
>Network Interconnect:
>
>cisco 2950 (one GBIC installed)
>
>Software:
>
>Cluster management Enfusion 6
>Computational Dock V4.0.1
>
>
>
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
>http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list