Need help setting up MPI on a cluster

Ron Choy cly at MIT.EDU
Thu Feb 28 13:35:42 PST 2002


(These are two posts that I made to mpi-bugs at mcs.anl.gov and 
comp.parallel.mpi  I got no reply, so I am trying my luck here ...)
(There are two questions, first is about setting up ch_p4mpd, second is 
about serv_p4 in ch_p4.  Solving either one of them is good enough for me - 
nolocal starts really slow right now!)
Q1.

I am running MPICH 1.2.3 on a cluster of 9 nodes, each with 2 Athlon
MP.   I installed mpich with ch_p4mpd on the frontend, and copied the
binaries over the the compute nodes.  The configure options I used are

--with-device=ch_p4mpd --prefix=/usr/local/mpich-mpd -rsh=ssh

Then I set up the mpd ring by running mpd on the frontend, and then
running
mpd -h frontend-0 -p <the port I got> -b

on each compute node.

Tests I tried:
tstmachine works fine.
mpdringsize gives me 9 (correct)
mpdringtest works
mpdtrace gives out something sensible. (the nodes form a ring)
A hello world type program runs fine (with net_recv errors at the end).
The program involves no MPI_Send and MPI_Recv


But when I try the cpi program in examples, I get
[cly at frontend-0 cly]$ mpirun -np 2 ./cpi
Process 0 on frontend-0
Process 1 on compute-0-7
p1_26310: (2.019748) net_recv failed for fd = 12
p1_26310:  p4_error: net_recv read, errno = : 111

This happens for any program that involves Send and Recv (Send, Recv,
Bcast .. etc never completes).  Any insights?  Anything I did wrong in
the setup?

-----------------------------------------------------------------------------------------
Q2.

Failing on ch_p4mpd  (see my previous email), I am trying to do serv_p4
on ch_p4.  But I am running into this problem.


[root at frontend-0 sbin]# ./chp4_servs
starting /usr/local/mpich/bin/serv_p4 on compute-0-0 with 1234
<snip>


[cly at frontend-0 sbin]$ ./chkserv
Bad message from compute-0-0: :Password
:
<snip>



I configured with -rsh=ssh, and my system is setup so that ssh requires
no password (mpirun works perfectly without serv_p4).


I tried searching on the web but I cant find any useful information.


And I cant see to find any documentation on serv_p4, on its option, and
why it's asking for a password (serv_p4 password on google doesnt yield
anything sensible).


Any ideas?






More information about the Beowulf mailing list