Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

Mpich 1.2.3 first run problem

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Felix Rauch rauch at inf.ethz.ch
Tue Sep 17 08:40:39 PDT 2002


On Mon, 16 Sep 2002, Jim Matthews wrote:
> I have been seeing a very strange problem with mpich 1.2.3 (and 1.2.4 as
> well from someone else who tested it).  The problem occurs immediately
> following a reboot of the cluster nodes.  What happens is that if I try
> to run a job on 16 processors, for example, the job will hang and never
> start when an mpirun is invoked.  The solution is to start with a 2
> processor job, which will always work, and from there go to a 3
> processor job working my way up to 16 processors.  Once that is done the
> job will run on all 16 processors (or however many) and continue to run
> and be re-run, with long periods of interruption, until the cluster is
> reboot at which time the problem will once again surface.

We had a similar problem once when we installed mpich to compare it
with Score. The problem was that mpich didn't work correctly when
there were two mpich-jobs on different machines but with identical
process identifiers (PIDs). To solve the problem we wrote a little
script that logged in onto all the nodes and started a different
number of small jobs on the nodes (e.g. node_number * 100 "echo"s).

- Felix
-- 
Felix Rauch                      | Email: rauch at inf.ethz.ch
Institute for Computer Systems   | Homepage: http://www.cs.inf.ethz.ch/~rauch/
ETH Zentrum / RZ H18             | Phone: +41 1 632 7489
CH - 8092 Zuerich / Switzerland  | Fax:   +41 1 632 1307




More information about the Beowulf mailing list