Dual-Athlon Cluster Problems

Thu Jan 23 10:52:05 PST 2003

On Thu, Jan 23, 2003 at 05:45:40PM +1100, Chris Steward wrote:
> 
> We're in the process of setting up a new 32-node dual-athlon cluster running
> Redhat 7.3, kernel 2.4.18-19.7.smp. The configuration is attached below. We're
> having problems with nodes hanging during calculations, sometimes only after
> several hours of runtime. We have a serial console connected to such nodes but
> that is unable to interact with the nodes once they hang. Nothing is logged
> either. It seems that running jobs on one CPU doesn't seem to present too much
> of a problem, but when the machines are fully loaded (both CPU's 100%
> utilization) errors start to occur and machines die – often up to 8 nodes
> within 24 hours. Temperature of nodes under full load is approximately 55C.
> We have tried using the "noapic" option but the problems still persist.  Using
> other software not requiring enfuzion 6 also produces the same problems.

I run a cluster of 96 dual AMD nodes (Tyan 2462 mb).

> The seek feedback on the following:
> 
> 1/ Are there issues using redhat 7.3 as opposed to 7.2 in such
>    a setup ?
No.

> 2/ Are there known issues with 2.4.18 kernels and AMD chips ?
>    We suspect the problems are kernel related.
ACPI does not work and you may need a newer version of i2c/lm_sensors.
Both issues cannot account for the problems you are seeing (I haven't
checked RedHat's 2.4.18-19.7.smp kernel, but older versions were
compiled with ACPI disabled - for (as far as I can see) good reasons).

> 3/ Are there any problems with dual-athlon clusters using the
>    MSI K7D Master L motherboard ?
Don't know.

> 4/ Are there any other outstanding issues with these machines 
>    under constant heavy load ?
In 99% of all the crashes I have seen on my cluster (and I have seen
a lot) the reason was bad memory. If you did not buy memory certified by
the company that sold you the motherboard exchange it and your problems
will go away.
[BTW: the "temperature" (defined has the highest of the three temperatures
displayed by lm_sensors) on the nodes ranges between 38C at the bottom of
the racks to 60C at the top. The crashes that I have seen on my cluster
were not correlated with the location of a node within a rack, i.e., they
did not seem to have anything to do with temperature.]

Martin

========================================================================
Martin Siegert
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6
========================================================================