Mpich 1.2.3 first run problem
Jim Matthews
beowulf at cfdlab.larc.nasa.gov
Mon Sep 16 15:10:09 PDT 2002
I have been seeing a very strange problem with mpich 1.2.3 (and 1.2.4 as
well from someone else who tested it). The problem occurs immediately
following a reboot of the cluster nodes. What happens is that if I try
to run a job on 16 processors, for example, the job will hang and never
start when an mpirun is invoked. The solution is to start with a 2
processor job, which will always work, and from there go to a 3
processor job working my way up to 16 processors. Once that is done the
job will run on all 16 processors (or however many) and continue to run
and be re-run, with long periods of interruption, until the cluster is
reboot at which time the problem will once again surface. I have
several sets of clusters and sub clusters ranging from 8 - 48 nodes and
consisting of either intel PIV or alpha systems, running Redhat linux
7.0 through 7.3. All of these systems exhibit the same problem with
mpich 1.2.3, upon reboot. Mpich 1.2.1 and LAM MPI does not exhibit this
behavior. Has anyone experienced this problem or know what could be
causing it?
Any help is greatly appreciated.
Thanks,
--JIM
--
-----------------------------------------------------------------------
James W. Matthews - UNIX System Administration / Beowulf Cluster Design
Raytheon Technical Services Company - NASA Langley Research Center
MS 128 - 18E West Taylor Street - Hampton, VA 23681
E-Mail: J.W.Matthews at LaRC.NASA.GOV - Phone: (757) 864-5259
-----------------------------------------------------------------------
More information about the Beowulf
mailing list