[Fwd: Problem using -p4pg and procgroup file]

Jeffery A. White j.a.white at larc.nasa.gov
Mon Sep 17 07:15:37 PDT 2001


Dear Group, 

  I recently sent an email message to the mpich help address
and the response I got back was rather unsatisfactory. Basically
they told me to RTFM. Considering that I had attached several
pages of MPICH debug information that (I think) showed a problem
I was not satisfied with thier response. I would greatly appreciate
any help/suggestion anyone on the list can provide. To view a complete
description of the problem and the mpich debug information please see
the attached file.

Thanks,

Jeff White
 
Jeffery A. White
email : j.a.white at larc.nasa.gov
Phone : (757) 864-6882 ; Fax : (757) 864-6243
URL   : http://hapb-www.larc.nasa.gov/~jawhite/
-------------- next part --------------

To whom it may concern,

I am trying to figure out how to use the -p4pg option in mpirun and I
am experiencing some difficulties.

  My cluster configuration is as follows:

node0 :
machine : Dual processor Supermicro Super 370DLE
cpu     : 1 GHz Pentium 3 
O.S.    : Redhat Linux 7.1
kernel  : 2.4.2-2smp
mpich   : 1.2.1

nodes1->18 :
machine : Compaq xp1000
cpu     : 667 MHz DEC alpha 21264
O.S.    : Redhat Linux 7.0
kernel  : 2.4.2
mpich   : 1.2.1

nodes 19->34 :
machine : Microway Screamer
cpu     : 667 MHz DEC alpha 21164
O.S.    : Redhat Linux 7.0
kernel  : 2.4.2
mpich   : 1.2.1

The heterogeneous nature of the machine has made me migrate from using
the -machinefile option to the -p4pg option. I have been 
trying to get a 2 processor job to run while submitting the mpirun
command from node0 (-nolocal is specified) and using either nodes 1
and 2 or nodes 2 and 3. If I use the -machinefile approach I am able to
run on any homogeneous combination of nodes. However, if I use the -p4pg
approach I have not been able to run unless my mpi master
node is node1. As long as node1 is the mpi master node then I can use
any one of nodes 2 through 18 as the 2nd processor. THe following 4 runs
illustrates what I have gotten to work as well as what doesn't work 
(and the subsequent error message). Runs 1, 2 and 3 worked and run 4 failed. 

1) When submitting from node0 using the -machinefile option to run on
   nodes 1 and 2 using mpirun configured as:

mpirun -v -keep_pg -nolocal -np 2 -machinefile vulcan.hosts /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver

the machinefile file vulcan.hosts contains:

node1
node2

the PIXXXX file created contains:

node1 0 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver
node2 1 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver

and the -v option reports

running /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver on 2 LINUX ch_p4 processors
Created /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/PI10802

the /var/log/messages file on node0 contains :

no events during this time frame

the /var/log/messages file on node1 contains :

Sep 13 15:49:32 hyprwulf1 xinetd[21912]: START: shell pid=23013 from=192.168.47.31
Sep 13 15:49:32 hyprwulf1 pam_rhosts_auth[23013]: allowed to jawhite at hyprwulf-boot0.hapb as jawhite
Sep 13 15:49:32 hyprwulf1 PAM_unix[23013]: (rsh) session opened for user jawhite by (uid=0)
Sep 13 15:49:32 hyprwulf1 in.rshd[23014]: jawhite at hyprwulf-boot0.hapb as jawhite: cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver -p4pg /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/PI15564 -p4wd /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases'

the /var/log/messages file on node2 contains :

Sep 13 15:49:32 hyprwulf2 xinetd[13163]: START: shell pid=13490 from=192.168.47.32
Sep 13 15:49:32 hyprwulf2 pam_rhosts_auth[13490]: allowed to jawhite at hyprwulf-boot1.hapb as jawhite
Sep 13 15:49:32 hyprwulf2 PAM_unix[13490]: (rsh) session opened for user jawhite by (uid=0)
Sep 13 15:49:32 hyprwulf2 in.rshd[13491]: jawhite at hyprwulf-boot1.hapb as jawhite: cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver node1 34248 \-p4amslave'

and the program executes successfully

2) When submitting from node0 using the -p4pg option to run on
   nodes 1 and 2 using mpirun configured as:

mpirun -v -nolocal -p4pg vulcan.hosts /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver

the progroup file vulcan.hosts contains:

node1 0 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
node2 1 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver

and the -v options reports

running /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver on 1 LINUX ch_p4 processors

the /var/log/messages file on node0 contains :

no events during this time frame

the /var/log/messages file on node1 contains :

Sep 13 15:41:46 hyprwulf1 xinetd[21912]: START: shell pid=22978 from=192.168.47.31
Sep 13 15:41:46 hyprwulf1 pam_rhosts_auth[22978]: allowed to jawhite at hyprwulf-boot0.hapb as jawhite
Sep 13 15:41:46 hyprwulf1 PAM_unix[22978]: (rsh) session opened for user jawhite by (uid=0)
Sep 13 15:41:46 hyprwulf1 in.rshd[22979]: jawhite at hyprwulf-boot0.hapb as jawhite: cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver -p4pg /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/vulcan.hosts -p4wd /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases'

the /var/log/messages file on node2 contains :

Sep 13 15:41:46 hyprwulf2 xinetd[13163]: START: shell pid=13472 from=192.168.47.32
Sep 13 15:41:46 hyprwulf2 pam_rhosts_auth[13472]: allowed to jawhite at hyprwulf-boot1.hapb as jawhite
Sep 13 15:41:46 hyprwulf2 PAM_unix[13472]: (rsh) session opened for user jawhite by (uid=0)
Sep 13 15:41:46 hyprwulf2 in.rshd[13473]: jawhite at hyprwulf-boot1.hapb as jawhite: cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver node1 34240 \-p4amslave'

and the program executes successfully

3) When submitting from node0 using the -machinefile option to run on
   nodes 2 and 3 using mpirun configured as:

mpirun -v -keep_pg -nolocal -np 2 -machinefile vulcan.hosts /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver

the machinefile file vulcan.hosts contains:

node2
node3

the PIXXXX file created contains:

node2 0 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver
node3 1 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver

and the -v options report

running /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver on 2 LINUX ch_p4 processors
Created /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/PI11592

the /var/log/messages file on node0 contains :

no events during this time frame

the /var/log/messages file on node1 contains :

no events during this time frame

the /var/log/messages file on node2 contains :

Sep 13 15:35:29 hyprwulf2 xinetd[13163]: START: shell pid=13451 from=192.168.47.31
Sep 13 15:35:29 hyprwulf2 pam_rhosts_auth[13451]: allowed to jawhite at hyprwulf-boot0.hapb as jawhite
Sep 13 15:35:29 hyprwulf2 PAM_unix[13451]: (rsh) session opened for user jawhite by (uid=0)
Sep 13 15:35:29 hyprwulf2 in.rshd[13452]: jawhite at hyprwulf-boot0.hapb as jawhite: cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver -p4pg /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/PI15167 -p4wd /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases'

the /var/log/messages file on node3 contains :

Sep 13 15:35:29 hyprwulf3 xinetd[11167]: START: shell pid=11435 from=192.168.47.33
Sep 13 15:35:29 hyprwulf3 pam_rhosts_auth[11435]: allowed to jawhite at hyprwulf-boot2.hapb as jawhite
Sep 13 15:35:29 hyprwulf3 PAM_unix[11435]: (rsh) session opened for user jawhite by (uid=0)
Sep 13 15:35:29 hyprwulf3 in.rshd[11436]: jawhite at hyprwulf-boot2.hapb as jawhite: cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver node2 33713 \-p4amslave'

and the program executes successfully

4) When submitting from node0 using the -p4pg option to run on
   nodes 2 and 3 using mpirun configured as:

mpirun -v -nolocal -p4pg vulcan.hosts /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver

the progroup file vulcan.hosts contains:

node2 0 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
node3 1 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver

and the -v options report

running /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver on 1 LINUX ch_p4 processors

the /var/log/messages file on node0 contains :

no events during this time frame

the /var/log/messages file on node1 contains :

Sep 13 14:54:48 hyprwulf1 xinetd[21912]: START: shell pid=22917 from=192.168.47.31
Sep 13 14:54:48 hyprwulf1 pam_rhosts_auth[22917]: allowed to jawhite at hyprwulf-boot0.hapb as jawhite
Sep 13 14:54:48 hyprwulf1 PAM_unix[22917]: (rsh) session opened for user jawhite by (uid=0)
Sep 13 14:54:48 hyprwulf1 in.rshd[22918]: jawhite at hyprwulf-boot0.hapb as jawhite: cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver -p4pg /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/vulcan.hosts -p4wd /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases'

the /var/log/messages file on node2 contains :

no events during this time frame

the /var/log/messages file on node3 contains :

Sep 13 14:54:48 hyprwulf3 xinetd[11167]: START: shell pid=11395 from=192.168.47.32
Sep 13 14:54:48 hyprwulf3 pam_rhosts_auth[11395]: allowed to jawhite at hyprwulf-boot1.hapb as jawhite
Sep 13 14:54:48 hyprwulf3 PAM_unix[11395]: (rsh) session opened for user jawhite by (uid=0)
Sep 13 14:54:48 hyprwulf3 in.rshd[11396]: jawhite at hyprwulf-boot1.hapb as jawhite: cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver node2 34232 \-p4amslave'

and the following error message is generated

rm_10957:  p4_error: rm_start: net_conn_to_listener failed: 34133

It appear that in case 4 even though I have requested node2 and node3 be used that a process is
being rhsh'd to node1 instead. The log message from node3 indicates it expects to connect
to node2 (partial proof that really did request node2) but since there is no process on 
node2 an error occurs.

The information below is the output stream from case 4 after envoking the -echo and -mpiversion options
++ echo 'default_arch   = LINUX'
++ echo 'default_device = ch_p4'
++ echo 'machine             = ch_p4'
++ '[' 1 -le 5 ']'
++ arg=-mpiversion
++ shift
++ '[' -x /usr/local/pkgs/mpich_1.2.1/bin/mpirun.ch_p4.args ']'
++ device_knows_arg=0
++ . /usr/local/pkgs/mpich_1.2.1/bin/mpirun.ch_p4.args
++ '[' 0 '!=' 0 ']'
+++ echo -mpiversion
+++ sed s/%a//g
++ proginstance=-mpiversion
++ '[' '' = '' -a '' = '' -a '!' -x -mpiversion ']'
++ fake_progname=-mpiversion
++ '[' 1 -le 4 ']'
++ arg=-nolocal
++ shift
++ nolocal=1
++ '[' 1 -le 3 ']'
++ arg=-p4pg
++ shift
++ '[' -x /usr/local/pkgs/mpich_1.2.1/bin/mpirun.ch_p4.args ']'
++ device_knows_arg=0
++ . /usr/local/pkgs/mpich_1.2.1/bin/mpirun.ch_p4.args
+++ '[' 1 -gt 1 ']'
+++ p4pgfile=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/vulcan.hosts
+++ shift
+++ leavePGFile=1
+++ device_knows_arg=1
++ '[' 1 '!=' 0 ']'
++ continue
++ '[' 1 -le 1 ']'
++ arg=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
++ shift
++ '[' -x /usr/local/pkgs/mpich_1.2.1/bin/mpirun.ch_p4.args ']'
++ device_knows_arg=0
++ . /usr/local/pkgs/mpich_1.2.1/bin/mpirun.ch_p4.args
++ '[' 0 '!=' 0 ']'
+++ echo /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
+++ sed s/%a//g
++ proginstance=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
++ '[' '' = '' -a -mpiversion = '' -a '!' -x /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver ']'
++ '[' '' = '' -a -x /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver ']'
++ progname=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
++ '[' 1 -le 0 ']'
++ '[' 1 -le 0 ']'
++ '[' '' = '' -a /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver = '' ']'
++ '[' -n -mpiversion -a -n /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver ']'
++ echo 'Unrecognized argument -mpiversion ignored.'
++ larch=
++ '[' -z '' ']'
++ larch=LINUX
++ '[' -n 'sed -e s@/tmp_mnt/@/@g' ']'
+++ pwd
+++ sed -e s@/tmp_mnt/@/@g
++ PWDtest=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases
++ '[' '!' -d /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases ']'
++ '[' -n /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases ']'
+++ echo /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases
+++ sed -e s@/tmp_mnt/@/@g
++ PWDtest2=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases
++ /bin/rm -f /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/.mpirtmp16410 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/.mpirtmp16410
+++ eval 'echo test > /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/.mpirtmp16410'
++ '[' '!' -s /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/.mpirtmp16410 ']'
++ PWD=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases
++ /bin/rm -f /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/.mpirtmp16410 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/.mpirtmp16410
++ '[' -n /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases ']'
++ PWD_TRIAL=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases
+++ echo /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
+++ sed 's/\/.*//'
++ tail=
++ '[' '' = '' ']'
++ true
++ '[' '' = '' -a -x /usr/local/pkgs/mpich_1.2.1/bin/tarch ']'
+++ /usr/local/pkgs/mpich_1.2.1/bin/tarch
++ arch=LINUX
++ '[' LINUX = IRIX64 -a '(' LINUX = IRIX -o LINUX = IRIXN32 ')' ']'
++ archlist=LINUX
++ '[' ch_p4 = '' ']'
++ '[' ch_p4 = p4 -o ch_p4 = execer -o ch_p4 = sgi_mp -o ch_p4 = ch_p4 -o ch_p4 = ch_p4-2 -o ch_p4 = globus -o ch_p4 = globus ']'
++ '[' '' = '' ']'
++ MPI_HOST=
++ '[' LINUX = ipsc860 ']'
+++ hostname
++ MPI_HOST=hyprwulf00
++ '[' hyprwulf00 = '' ']'
++ '[' /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases '!=' '' ']'
+++ pwd
+++ sed -e s%/tmp_mnt/%/%g
++ PWD_TRIAL=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases
++ '[' '!' -d /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases ']'
++ '[' 1 = 1 ']'
++ cnt=1
++ '[' 0 -gt 1 ']'
++ echo 'running /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver on 1 LINUX ch_p4 processors'
+ argsset=1
+ mpirun_version=
+ mpirun_version=/usr/local/pkgs/mpich_1.2.1/bin/mpirun.ch_p4
+ exitstat=1
+ '[' -n /usr/local/pkgs/mpich_1.2.1/bin/mpirun.ch_p4 ']'
+ '[' -x /usr/local/pkgs/mpich_1.2.1/bin/mpirun.ch_p4 ']'
+ . /usr/local/pkgs/mpich_1.2.1/bin/mpirun.ch_p4
++ exitstatus=1
++ '[' -z 1 ']'
++ '[' -n '' ']'
++ '[' -n '' ']'
++ '[' '' = shared ']'
++ MPI_MAX_CLUSTER_SIZE=1
++ . /usr/local/pkgs/mpich_1.2.1/bin/mpirun.pg
+++ '[' 1 = '' ']'
+++ '[' 0 = 0 ']'
+++ narch=1
+++ arch1=LINUX
+++ archlist1=LINUX
+++ archlocal=LINUX
+++ np1=1
+++ '[' 1 = 1 ']'
+++ procFound=0
+++ machinelist=
+++ archuselist=
+++ nprocuselist=
+++ curarch=1
+++ nolocalsave=1
+++ archlocal=LINUX
+++ '[' 1 -le 1 ']'
+++ eval 'arch=$arch1'
++++ arch=LINUX
+++ eval 'archlist=$archlist1'
++++ archlist=LINUX
+++ '[' -z LINUX ']'
+++ eval 'np=$np1'
++++ np=1
+++ '[' -z 1 ']'
+++ eval 'mFile=$machineFile1'
++++ mFile=
+++ '[' -n '' -a -r '' ']'
+++ '[' -z '' ']'
+++ '[' ch_p4 = ibmspx -a -x /usr/local/bin/getjid ']'
+++ machineDir=/usr/local/pkgs/mpich_1.2.1/share
+++ machineFile=/usr/local/pkgs/mpich_1.2.1/share/machines.LINUX
+++ '[' -r /usr/local/pkgs/mpich_1.2.1/share/machines.LINUX ']'
+++ break
+++ '[' -z /usr/local/pkgs/mpich_1.2.1/share/machines.LINUX -o '!' -s /usr/local/pkgs/mpich_1.2.1/share/machines.LINUX -o '!' -r /usr/local/pkgs/mpich_1.2.1/share/machines.LINUX ']'
++++ expr hyprwulf00 : '\([^\.]*\).*'
+++ MPI_HOSTLeader=hyprwulf00
+++ '[' '' = yes ']'
+++ '[' 1 = 0 -o 1 -gt 1 ']'
+++ '[' 1 -gt 1 -o 1 = 1 ']'
++++ cat /usr/local/pkgs/mpich_1.2.1/share/machines.LINUX
++++ sed -e '/^#/d' -e 's/#.*^//g'
++++ grep -v '^hyprwulf00\([ -\.:]\)'
++++ head -1
++++ tr '\012' ' '
+++ machineavail=mpi1 
+++ KeepHost=0
+++ loopcnt=0
+++ '[' -z 1 ']'
+++ '[' 1 = 0 -a 1 -gt 1 ']'
++++ expr 1 - 0
+++ nleft=1
+++ '[' 1 -lt 0 ']'
+++ '[' 0 -lt 1 ']'
+++ nfound=0
+++ nprocmachine=1
++++ expr mpi1 : '.*:\([0-9]*\)'
+++ ntest=
+++ '[' -n '' -a '' '!=' 0 ']'
++++ expr mpi1 : '\([^\.]*\).*'
+++ machineNameLeader=mpi1
+++ '[' 1 = 1 -o 0 = 1 -o '(' mpi1 '!=' hyprwulf00 -a mpi1 '!=' hyprwulf00 ')' ']'
+++ '[' 1 -gt 1 ']'
+++ machinelist= mpi1
+++ archuselist= LINUX
+++ nprocuselist= 1
++++ expr 0 + 1
+++ procFound=1
++++ expr 0 + 1
+++ nfound=1
++++ expr 1 - 1
+++ nleft=0
+++ '[' 1 = 1 ']'
+++ break
++++ expr 0 + 1
+++ loopcnt=1
+++ '[' 1 = 0 -a 1 -gt 1 ']'
+++ '[' 1 -lt 1 ']'
++++ expr 1 + 1
+++ curarch=2
+++ procFound=0
+++ nolocal=1
+++ machineFile=
+++ '[' 2 -le 1 ']'
+++ nolocal=1
+++ '[' 1 '!=' 1 ']'
+++ break
++ prognamemain=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
++ '[' -z /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/vulcan.hosts ']'
++ /bin/sync
++ '[' '' = '' ']'
++ p4workdir=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases
++ startpgm=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver  -p4pg /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/vulcan.hosts -p4wd /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases
++ '[' '' '!=' '' ']'
++ MPIRUN_DEVICE=ch_p4
++ export MPIRUN_DEVICE
++ '[' 0 = 1 ']'
++ doitall=eval
++ '[' 1 = 1 ']'
++ '[' '' = /dev/null ']'
++ doitall=eval /usr/bin/rsh -n mpi1
++ eval /usr/bin/rsh -n mpi1 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver -p4pg /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/vulcan.hosts -p4wd /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases
+++ /usr/bin/rsh -n mpi1 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver -p4pg /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/vulcan.hosts -p4wd /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases

default_arch   = LINUX
default_device = ch_p4
machine      = ch_p4
Unrecognized argument -mpiversion ignored.
running /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver on 1 LINUX ch_p4 processors
rm_11548:  p4_error: rm_start: net_conn_to_listener failed: 34288
bm_list_23231:  p4_error: interrupt SIGINT: 2
p0_23230:  p4_error: interrupt SIGINT: 2
Broken pipe
P4 procgroup file is /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/vulcan.hosts.

The result from cat /usr/local/pkgs/mpich_1.2.1/share/machines.LINUX is

mpi1
mpi2
mpi3
mpi4
mpi5
mpi6
mpi7
mpi8
mpi9
mpi10
mpi11
mpi12
mpi13
mpi14
mpi15
mpi16
mpi17
mpi18

however our /etc/hosts file contains entries to 

mpi1 node1 
mpi2 node2 

so using the p4pg file vulcan.hosts containing:

node2 0 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
node3 1 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver

or 

mpi2 0 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
mpi3 1 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver

both produce the same result/error message.

Removing mpi1 from the machines.LINUX file seems to fix the problem by shifting the
master process to mpi2/node2. But I suspect that if I requested nodes 3 and 4 the 
error would happen again. I had hoped that using the p4pg file would have allowed
me to pick any node as my master node. 


More information about the Beowulf mailing list