Help on cluster hang problem...
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Cris Rhea crhea at mayo.eduSat May 26 22:23:42 PDT 2001
- Previous message: RH7.1 - 3Com PCI 3c905C Tornado - Interrupt posted but not delivered -- IRQ blocked by another
- Next message: Help on cluster hang problem...
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I've been using Linux for several years, but am new to Linux cluster computing. I set up a "proof of concept cluster" with 4 nodes- each node is a 1.2GHz Athlon on a MicroStar K7TPro2-A motherboard with 1GB of RAM (RackSaver 1200). RedHat 7.1 is loaded locally on each system. Also loaded mpich-1.2.0-10.i386.rpm on each system and set up the rhosts/hosts.equiv to allow all the rsh stuff... Systems are interconnected with Intel 10/100 Ethernet cards. One of the research PhD's in my group has a program that has run successfully on other supercomputer-class systems (Cray and SGI). Very CPU-intensive, but does nothing fancy other than using MPI for communication (very little disk I/O, etc.). /home file system is NFS mounted on each system. I've tried NFS server is the master node or another system outside the cluster. Even though this code runs as a normal user (not root), it will hard-hang the "master" node in about 10 minutes. "Hard-hang" means nothing on console, disk light on solid, doesn't respond to reset or power switches- have to reset by pulling plug. I've tried the stock 2.4.2-2 kernel that loads with RedHat 7.1, I've tried the 2.4.2 kernel recompiled to specifically call the CPU an Athlon, and I've tried downloading/using the 2.4.4 kernel. All of my attempts produce the same result- his program can crash the system every time it is run. I've searched the normal dejanews/altavista sites for Linux/Athlon/hang, but nothing interesting pops out. I must be missing something simple- the 2.4.X kernels can't be that unstable. Does this ring a bell with anyone in the group? TIA- -- Cris --- Cristopher J. Rhea Mayo Foundation Research Computing Facility Pavilion 2-25 crhea at Mayo.EDU Rochester, MN 55905 Fax: (507) 266-4486 (507) 284-0587
- Previous message: RH7.1 - 3Com PCI 3c905C Tornado - Interrupt posted but not delivered -- IRQ blocked by another
- Next message: Help on cluster hang problem...
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
