Need help setting up MPI on a cluster
Ron Choy
cly at MIT.EDU
Thu Feb 28 13:35:42 PST 2002
(These are two posts that I made to mpi-bugs at mcs.anl.gov and
comp.parallel.mpi I got no reply, so I am trying my luck here ...)
(There are two questions, first is about setting up ch_p4mpd, second is
about serv_p4 in ch_p4. Solving either one of them is good enough for me -
nolocal starts really slow right now!)
Q1.
I am running MPICH 1.2.3 on a cluster of 9 nodes, each with 2 Athlon
MP. I installed mpich with ch_p4mpd on the frontend, and copied the
binaries over the the compute nodes. The configure options I used are
--with-device=ch_p4mpd --prefix=/usr/local/mpich-mpd -rsh=ssh
Then I set up the mpd ring by running mpd on the frontend, and then
running
mpd -h frontend-0 -p <the port I got> -b
on each compute node.
Tests I tried:
tstmachine works fine.
mpdringsize gives me 9 (correct)
mpdringtest works
mpdtrace gives out something sensible. (the nodes form a ring)
A hello world type program runs fine (with net_recv errors at the end).
The program involves no MPI_Send and MPI_Recv
But when I try the cpi program in examples, I get
[cly at frontend-0 cly]$ mpirun -np 2 ./cpi
Process 0 on frontend-0
Process 1 on compute-0-7
p1_26310: (2.019748) net_recv failed for fd = 12
p1_26310: p4_error: net_recv read, errno = : 111
This happens for any program that involves Send and Recv (Send, Recv,
Bcast .. etc never completes). Any insights? Anything I did wrong in
the setup?
-----------------------------------------------------------------------------------------
Q2.
Failing on ch_p4mpd (see my previous email), I am trying to do serv_p4
on ch_p4. But I am running into this problem.
[root at frontend-0 sbin]# ./chp4_servs
starting /usr/local/mpich/bin/serv_p4 on compute-0-0 with 1234
<snip>
[cly at frontend-0 sbin]$ ./chkserv
Bad message from compute-0-0: :Password
:
<snip>
I configured with -rsh=ssh, and my system is setup so that ssh requires
no password (mpirun works perfectly without serv_p4).
I tried searching on the web but I cant find any useful information.
And I cant see to find any documentation on serv_p4, on its option, and
why it's asking for a password (serv_p4 password on google doesnt yield
anything sensible).
Any ideas?
More information about the Beowulf
mailing list