MPICH-1.2.2.3 Problem
Gabriel J. Weinstock
gabriel.weinstock at dnamerican.com
Wed Oct 24 12:12:21 PDT 2001
I'm trying to get MPICH 1.2.2.3 running on a 4 node cluster of PIII 1 GHz
machines. the tstmachines program runs without error and the rsh mechanism is
set up and functioning properly. LAM-MPI works out of the box, so we decided
to use that for awhile, but we're going to need a production environment and
MPICH seemed more suitable.
Anyway, I compile the example `cpi.c' program, and do `mpirun -v -np 4
cpi'. Nothing happens for a few minutes, then I get a flurry of `Connection
failed for reason: : Connection timed out' messages, followed by
p1_10899: p4_error: Timeout in establishing connection to remote process: 0
p3_15707: p4_error: net_recv read: probable EOF on socket: 1
bm_list_4303: (378.120857) Listener: Unable to interrupt client pid=4302.
We had a similar problem about 2 months ago which led us to abandon this
implementation. There seem to be a number of people having this problem, but
no one, and I mean no one, seems to know the answer. Any help would be
greatly appreciated.
Thanks,
Gabe
More information about the Beowulf
mailing list