Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

Dual-Athlon Cluster Problems

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Martin Siegert siegert at sfu.ca
Thu Jan 23 10:52:05 PST 2003


On Thu, Jan 23, 2003 at 05:45:40PM +1100, Chris Steward wrote:
> 
> We're in the process of setting up a new 32-node dual-athlon cluster running
> Redhat 7.3, kernel 2.4.18-19.7.smp. The configuration is attached below. We're
> having problems with nodes hanging during calculations, sometimes only after
> several hours of runtime. We have a serial console connected to such nodes but
> that is unable to interact with the nodes once they hang. Nothing is logged
> either. It seems that running jobs on one CPU doesn't seem to present too much
> of a problem, but when the machines are fully loaded (both CPU's 100%
> utilization) errors start to occur and machines die – often up to 8 nodes
> within 24 hours. Temperature of nodes under full load is approximately 55C.
> We have tried using the "noapic" option but the problems still persist.  Using
> other software not requiring enfuzion 6 also produces the same problems.

I run a cluster of 96 dual AMD nodes (Tyan 2462 mb).

> The seek feedback on the following:
> 
> 1/ Are there issues using redhat 7.3 as opposed to 7.2 in such
>    a setup ?
No.

> 2/ Are there known issues with 2.4.18 kernels and AMD chips ?
>    We suspect the problems are kernel related.
ACPI does not work and you may need a newer version of i2c/lm_sensors.
Both issues cannot account for the problems you are seeing (I haven't
checked RedHat's 2.4.18-19.7.smp kernel, but older versions were
compiled with ACPI disabled - for (as far as I can see) good reasons).

> 3/ Are there any problems with dual-athlon clusters using the
>    MSI K7D Master L motherboard ?
Don't know.

> 4/ Are there any other outstanding issues with these machines 
>    under constant heavy load ?
In 99% of all the crashes I have seen on my cluster (and I have seen
a lot) the reason was bad memory. If you did not buy memory certified by
the company that sold you the motherboard exchange it and your problems
will go away.
[BTW: the "temperature" (defined has the highest of the three temperatures
displayed by lm_sensors) on the nodes ranges between 38C at the bottom of
the racks to 60C at the top. The crashes that I have seen on my cluster
were not correlated with the location of a node within a rack, i.e., they
did not seem to have anything to do with temperature.]

Martin

========================================================================
Martin Siegert
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6
========================================================================



More information about the Beowulf mailing list