Help on cluster hang problem...

Cris Rhea crhea at mayo.edu
Sat May 26 22:23:42 PDT 2001


I've been using Linux for several years, but am new to Linux cluster computing.

I set up a "proof of concept cluster" with 4 nodes- each node is a 1.2GHz Athlon
on a MicroStar K7TPro2-A motherboard with 1GB of RAM (RackSaver 1200). 

RedHat 7.1 is loaded locally on each system. Also loaded  mpich-1.2.0-10.i386.rpm
on each system and set up the rhosts/hosts.equiv to allow all the rsh stuff...

Systems are interconnected with Intel 10/100 Ethernet cards.

One of the research PhD's in my group has a program that has run successfully on
other supercomputer-class systems (Cray and SGI). Very CPU-intensive, but 
does nothing fancy other than using MPI for communication (very little disk I/O, 
etc.).

/home file system is NFS mounted on each system. I've tried NFS server is the master 
node or another system outside the cluster.

Even though this code runs as a normal user (not root), it will hard-hang the 
"master" node in about 10 minutes. "Hard-hang" means nothing on console, disk light on 
solid, doesn't respond to reset or power switches- have to reset by pulling plug.

I've tried the stock 2.4.2-2 kernel that loads with RedHat 7.1, I've tried the 2.4.2
kernel recompiled to specifically call the CPU an Athlon, and I've tried 
downloading/using the 2.4.4 kernel.  All of my attempts produce the same result- 
his program can crash the system every time it is run. 

I've searched the normal dejanews/altavista sites for Linux/Athlon/hang, but nothing
interesting pops out. I must be missing something simple- the 2.4.X kernels
can't be that unstable.

Does this ring a bell with anyone in the group?

TIA-

-- Cris

---
 Cristopher J. Rhea                     Mayo Foundation
 Research Computing Facility             Pavilion 2-25
 crhea at Mayo.EDU                        Rochester, MN 55905
 Fax: (507) 266-4486                     (507) 284-0587











More information about the Beowulf mailing list