Mpich 1.2.3 first run problem
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Felix Rauch rauch at inf.ethz.chTue Sep 17 08:40:39 PDT 2002
- Previous message: Mpich 1.2.3 first run problem
- Next message: Mpich 1.2.3 first run problem
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Mon, 16 Sep 2002, Jim Matthews wrote: > I have been seeing a very strange problem with mpich 1.2.3 (and 1.2.4 as > well from someone else who tested it). The problem occurs immediately > following a reboot of the cluster nodes. What happens is that if I try > to run a job on 16 processors, for example, the job will hang and never > start when an mpirun is invoked. The solution is to start with a 2 > processor job, which will always work, and from there go to a 3 > processor job working my way up to 16 processors. Once that is done the > job will run on all 16 processors (or however many) and continue to run > and be re-run, with long periods of interruption, until the cluster is > reboot at which time the problem will once again surface. We had a similar problem once when we installed mpich to compare it with Score. The problem was that mpich didn't work correctly when there were two mpich-jobs on different machines but with identical process identifiers (PIDs). To solve the problem we wrote a little script that logged in onto all the nodes and started a different number of small jobs on the nodes (e.g. node_number * 100 "echo"s). - Felix -- Felix Rauch | Email: rauch at inf.ethz.ch Institute for Computer Systems | Homepage: http://www.cs.inf.ethz.ch/~rauch/ ETH Zentrum / RZ H18 | Phone: +41 1 632 7489 CH - 8092 Zuerich / Switzerland | Fax: +41 1 632 1307
- Previous message: Mpich 1.2.3 first run problem
- Next message: Mpich 1.2.3 first run problem
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
