Help on cluster hang problem...
Cris Rhea
crhea at mayo.edu
Sat May 26 22:23:42 PDT 2001
I've been using Linux for several years, but am new to Linux cluster computing.
I set up a "proof of concept cluster" with 4 nodes- each node is a 1.2GHz Athlon
on a MicroStar K7TPro2-A motherboard with 1GB of RAM (RackSaver 1200).
RedHat 7.1 is loaded locally on each system. Also loaded mpich-1.2.0-10.i386.rpm
on each system and set up the rhosts/hosts.equiv to allow all the rsh stuff...
Systems are interconnected with Intel 10/100 Ethernet cards.
One of the research PhD's in my group has a program that has run successfully on
other supercomputer-class systems (Cray and SGI). Very CPU-intensive, but
does nothing fancy other than using MPI for communication (very little disk I/O,
etc.).
/home file system is NFS mounted on each system. I've tried NFS server is the master
node or another system outside the cluster.
Even though this code runs as a normal user (not root), it will hard-hang the
"master" node in about 10 minutes. "Hard-hang" means nothing on console, disk light on
solid, doesn't respond to reset or power switches- have to reset by pulling plug.
I've tried the stock 2.4.2-2 kernel that loads with RedHat 7.1, I've tried the 2.4.2
kernel recompiled to specifically call the CPU an Athlon, and I've tried
downloading/using the 2.4.4 kernel. All of my attempts produce the same result-
his program can crash the system every time it is run.
I've searched the normal dejanews/altavista sites for Linux/Athlon/hang, but nothing
interesting pops out. I must be missing something simple- the 2.4.X kernels
can't be that unstable.
Does this ring a bell with anyone in the group?
TIA-
-- Cris
---
Cristopher J. Rhea Mayo Foundation
Research Computing Facility Pavilion 2-25
crhea at Mayo.EDU Rochester, MN 55905
Fax: (507) 266-4486 (507) 284-0587
More information about the Beowulf
mailing list