Mpich 1.2.3 first run problem
Felix Rauch
rauch at inf.ethz.ch
Tue Sep 17 08:40:39 PDT 2002
On Mon, 16 Sep 2002, Jim Matthews wrote:
> I have been seeing a very strange problem with mpich 1.2.3 (and 1.2.4 as
> well from someone else who tested it). The problem occurs immediately
> following a reboot of the cluster nodes. What happens is that if I try
> to run a job on 16 processors, for example, the job will hang and never
> start when an mpirun is invoked. The solution is to start with a 2
> processor job, which will always work, and from there go to a 3
> processor job working my way up to 16 processors. Once that is done the
> job will run on all 16 processors (or however many) and continue to run
> and be re-run, with long periods of interruption, until the cluster is
> reboot at which time the problem will once again surface.
We had a similar problem once when we installed mpich to compare it
with Score. The problem was that mpich didn't work correctly when
there were two mpich-jobs on different machines but with identical
process identifiers (PIDs). To solve the problem we wrote a little
script that logged in onto all the nodes and started a different
number of small jobs on the nodes (e.g. node_number * 100 "echo"s).
- Felix
--
Felix Rauch | Email: rauch at inf.ethz.ch
Institute for Computer Systems | Homepage: http://www.cs.inf.ethz.ch/~rauch/
ETH Zentrum / RZ H18 | Phone: +41 1 632 7489
CH - 8092 Zuerich / Switzerland | Fax: +41 1 632 1307
More information about the Beowulf
mailing list