Mpich 1.2.3 first run problem
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Jim Matthews beowulf at cfdlab.larc.nasa.govMon Sep 16 15:10:09 PDT 2002
- Previous message: TFCC Industrial Website
- Next message: Mpich 1.2.3 first run problem
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I have been seeing a very strange problem with mpich 1.2.3 (and 1.2.4 as well from someone else who tested it). The problem occurs immediately following a reboot of the cluster nodes. What happens is that if I try to run a job on 16 processors, for example, the job will hang and never start when an mpirun is invoked. The solution is to start with a 2 processor job, which will always work, and from there go to a 3 processor job working my way up to 16 processors. Once that is done the job will run on all 16 processors (or however many) and continue to run and be re-run, with long periods of interruption, until the cluster is reboot at which time the problem will once again surface. I have several sets of clusters and sub clusters ranging from 8 - 48 nodes and consisting of either intel PIV or alpha systems, running Redhat linux 7.0 through 7.3. All of these systems exhibit the same problem with mpich 1.2.3, upon reboot. Mpich 1.2.1 and LAM MPI does not exhibit this behavior. Has anyone experienced this problem or know what could be causing it? Any help is greatly appreciated. Thanks, --JIM -- ----------------------------------------------------------------------- James W. Matthews - UNIX System Administration / Beowulf Cluster Design Raytheon Technical Services Company - NASA Langley Research Center MS 128 - 18E West Taylor Street - Hampton, VA 23681 E-Mail: J.W.Matthews at LaRC.NASA.GOV - Phone: (757) 864-5259 -----------------------------------------------------------------------
- Previous message: TFCC Industrial Website
- Next message: Mpich 1.2.3 first run problem
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
