Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

Mpich 1.2.3 first run problem

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

William Gropp gropp at mcs.anl.gov
Tue Sep 17 11:04:41 PDT 2002


At 05:40 PM 9/17/2002 +0200, Felix Rauch wrote:
>On Mon, 16 Sep 2002, Jim Matthews wrote:
> > I have been seeing a very strange problem with mpich 1.2.3 (and 1.2.4 as
> > well from someone else who tested it).  The problem occurs immediately
> > following a reboot of the cluster nodes.  What happens is that if I try
> > to run a job on 16 processors, for example, the job will hang and never
> > start when an mpirun is invoked.  The solution is to start with a 2
> > processor job, which will always work, and from there go to a 3
> > processor job working my way up to 16 processors.  Once that is done the
> > job will run on all 16 processors (or however many) and continue to run
> > and be re-run, with long periods of interruption, until the cluster is
> > reboot at which time the problem will once again surface.
>
>We had a similar problem once when we installed mpich to compare it
>with Score. The problem was that mpich didn't work correctly when
>there were two mpich-jobs on different machines but with identical
>process identifiers (PIDs). To solve the problem we wrote a little
>script that logged in onto all the nodes and started a different
>number of small jobs on the nodes (e.g. node_number * 100 "echo"s).

Thanks!  That points us to the code that must be broken.  We'll have a fix 
in a few days.

Bill




More information about the Beowulf mailing list