Mpich 1.2.3 first run problem

Jim Matthews beowulf at cfdlab.larc.nasa.gov
Mon Sep 16 15:10:09 PDT 2002


I have been seeing a very strange problem with mpich 1.2.3 (and 1.2.4 as 
well from someone else who tested it).  The problem occurs immediately 
following a reboot of the cluster nodes.  What happens is that if I try 
to run a job on 16 processors, for example, the job will hang and never 
start when an mpirun is invoked.  The solution is to start with a 2 
processor job, which will always work, and from there go to a 3 
processor job working my way up to 16 processors.  Once that is done the 
job will run on all 16 processors (or however many) and continue to run 
and be re-run, with long periods of interruption, until the cluster is 
reboot at which time the problem will once again surface.  I have 
several sets of clusters and sub clusters ranging from 8 - 48 nodes and 
consisting of either intel PIV or alpha systems, running Redhat linux 
7.0 through 7.3.  All of these systems exhibit the same problem with 
mpich 1.2.3, upon reboot.  Mpich 1.2.1 and LAM MPI does not exhibit this 
behavior.  Has anyone experienced this problem or know what could be 
causing it?

Any help is greatly appreciated.

Thanks,

 
--JIM

-- 

 -----------------------------------------------------------------------
 James W. Matthews - UNIX System Administration / Beowulf Cluster Design
 Raytheon Technical Services Company - NASA Langley Research Center
 MS 128 - 18E West Taylor Street - Hampton, VA 23681
 E-Mail: J.W.Matthews at LaRC.NASA.GOV - Phone: (757) 864-5259
 -----------------------------------------------------------------------





More information about the Beowulf mailing list