MPICH-1.2.2.3 Problem

Gabriel J. Weinstock gabriel.weinstock at dnamerican.com
Wed Oct 24 12:12:21 PDT 2001


  I'm trying to get MPICH 1.2.2.3 running on a 4 node cluster of PIII 1 GHz 
machines. the tstmachines program runs without error and the rsh mechanism is 
set up and functioning properly. LAM-MPI works out of the box, so we decided 
to use that for awhile, but we're going to need a production environment and 
MPICH seemed more suitable.
  Anyway, I compile the example `cpi.c' program, and do `mpirun -v -np 4 
cpi'. Nothing happens for a few minutes, then I get a flurry of `Connection 
failed for reason: : Connection timed out' messages, followed by

p1_10899: p4_error: Timeout in establishing connection to remote process: 0
p3_15707: p4_error: net_recv read: probable EOF on socket: 1
bm_list_4303: (378.120857) Listener: Unable to interrupt client pid=4302.

  We had a similar problem about 2 months ago which led us to abandon this 
implementation. There seem to be a number of people having this problem, but 
no one, and I mean no one, seems to know the answer. Any help would be 
greatly appreciated.
Thanks,
Gabe




More information about the Beowulf mailing list