mpich bug?
Jeffery A. White
j.a.white at larc.nasa.gov
Thu Sep 13 13:25:02 PDT 2001
Dear list,
My previous message had some incorrect information. My apologies.
I have investigated further looking in the /var/log/message file
and have found the following.
My cluster configuration is as follows:
node0 :
machine : Dual processor Supermicro Super 370DLE
cpu : 1 GHz Pentium 3
O.S. : Redhat Linux 7.1
kernel : 2.4.2-2smp
mpich : 1.2.1
nodes1->18 :
machine : Compaq xp1000
cpu : 667 MHz DEC alpha 21264
O.S. : Redhat Linux 7.0
kernel : 2.4.2
mpich : 1.2.1
nodes 19->34 :
machine : Microway Screamer
cpu : 667 MHz DEC alpha 21164
O.S. : Redhat Linux 7.0
kernel : 2.4.2
mpich : 1.2.1
The heterogeneous nature of the machine has made me migrate from using
the -machinefile option to the -p4pg option. I have been
trying to get a 2 processor job to run while submitting the mpirun
command from node0 (-nolocal is specified) and using either nodes 1
and 2 or nodes 2 and 3. If I use the -machinefile approach I am able to
run on any homogeneous combination of nodes. However, if I use the -p4pg
approach I have not been able to run unless my mpi master
node is node1. As long as node1 is the mpi master node then I can use
any one of nodes 2 through 18 as the 2nd processor. THe following 4 runs
illustrates what I have gotten to work as well as
what doesn't work (and the subsequent error message). Runs 1, 2 and 3
worked and run 4 failed.
1) When submitting from node0 using the -machinefile option to run on
nodes 1 and 2 using mpirun configured as:
mpirun -v -keep_pg -nolocal -np 2 -machinefile vulcan.hosts
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver
the machinefile file vulcan.hosts contains:
node1
node2
the PIXXXX file created contains:
node1 0
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver
node2 1
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver
and the -v option reports
running
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver on 2
LINUX ch_p4 processors
Created /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/PI10802
the /var/log/messages file on node0 contains :
no events during this time frame
the /var/log/messages file on node1 contains :
Sep 13 15:49:32 hyprwulf1 xinetd[21912]: START: shell pid=23013
from=192.168.47.31
Sep 13 15:49:32 hyprwulf1 pam_rhosts_auth[23013]: allowed to
jawhite at hyprwulf-boot0.hapb as jawhite
Sep 13 15:49:32 hyprwulf1 PAM_unix[23013]: (rsh) session opened for user
jawhite by (uid=0)
Sep 13 15:49:32 hyprwulf1 in.rshd[23014]: jawhite at hyprwulf-boot0.hapb as
jawhite:
cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver
-p4pg /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/PI15564 -p4wd
/home0/jawhite/Vul
2) When submitting from node0 using the -p4pg option to run on
nodes 1 and 2 using mpirun configured as:
mpirun -v -nolocal -p4pg vulcan.hosts
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
the p4pg file vulcan.hosts contains:
node1 0 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
node2 1 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
and the -v options reports
running /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
on 1 LINUX ch_p4 processors
the /var/log/messages file on node0 contains :
no events during this time frame
the /var/log/messages file on node1 contains :
Sep 13 15:41:46 hyprwulf1 xinetd[21912]: START: shell pid=22978
from=192.168.47.31
Sep 13 15:41:46 hyprwulf1 pam_rhosts_auth[22978]: allowed to
jawhite at hyprwulf-boot0.hapb as jawhite
Sep 13 15:41:46 hyprwulf1 PAM_unix[22978]: (rsh) session opened for user
jawhite by (uid=0)
Sep 13 15:41:46 hyprwulf1 in.rshd[22979]: jawhite at hyprwulf-boot0.hapb as
jawhite:
cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
-p4pg /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/vulcan.hosts
-p4wd /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases'
the /var/log/messages file on node2 contains :
Sep 13 15:41:46 hyprwulf2 xinetd[13163]: START: shell pid=13472
from=192.168.47.32
Sep 13 15:41:46 hyprwulf2 pam_rhosts_auth[13472]: allowed to
jawhite at hyprwulf-boot1.hapb as jawhite
Sep 13 15:41:46 hyprwulf2 PAM_unix[13472]: (rsh) session opened for user
jawhite by (uid=0)
Sep 13 15:41:46 hyprwulf2 in.rshd[13473]: jawhite at hyprwulf-boot1.hapb as
jawhite:
cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
node1 34240 \-p4amslave'
and the program executes successfullycan/DEC_21264/Ver_4.3/Sample_cases'
3) When submitting from node0 using the -machinefile option to run on
nodes 2 and 3 using mpirun configured as:
mpirun -v -keep_pg -nolocal -np 2 -machinefile vulcan.hosts
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
the machinefile file vulcan.hosts contains:
node2
node3
the PIXXXX file created contains:
node2 0
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver
node3 1
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver
and the -v options report
running
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver on 2
LINUX ch_p4 processors
Created /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/PI11592
the /var/log/messages file on node0 contains :
no events during this time frame
the /var/log/messages file on node1 contains :
no events during this time frame
the /var/log/messages file on node2 contains :
Sep 13 15:35:29 hyprwulf2 xinetd[13163]: START: shell pid=13451
from=192.168.47.31
Sep 13 15:35:29 hyprwulf2 pam_rhosts_auth[13451]: allowed to
jawhite at hyprwulf-boot0.hapb as jawhite
Sep 13 15:35:29 hyprwulf2 PAM_unix[13451]: (rsh) session opened for user
jawhite by (uid=0)
Sep 13 15:35:29 hyprwulf2 in.rshd[13452]: jawhite at hyprwulf-boot0.hapb as
jawhite:
cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver
-p4pg /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/PI15167 -p4wd
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases'
the /var/log/messages file on node3 contains :
Sep 13 15:35:29 hyprwulf3 xinetd[11167]: START: shell pid=11435
from=192.168.47.33
Sep 13 15:35:29 hyprwulf3 pam_rhosts_auth[11435]: allowed to
jawhite at hyprwulf-boot2.hapb as jawhite
Sep 13 15:35:29 hyprwulf3 PAM_unix[11435]: (rsh) session opened for user
jawhite by (uid=0)
Sep 13 15:35:29 hyprwulf3 in.rshd[11436]: jawhite at hyprwulf-boot2.hapb as
jawhite:
cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver
node2 33713 \-p4amslave'
and the program executes successfully
4) When submitting from node0 using the -p4pg option to run on
nodes 2 and 3 using mpirun configured as:
mpirun -v -nolocal -p4pg vulcan.hosts
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
the p4pg file vulcan.hosts contains:
node2 0 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
node3 1 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
and the -v options report
running /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
on 1 LINUX ch_p4 processors
the /var/log/messages file on node0 contains :
no events during this time frame
the /var/log/messages file on node1 contains :
Sep 13 14:54:48 hyprwulf1 xinetd[21912]: START: shell pid=22917
from=192.168.47.31
Sep 13 14:54:48 hyprwulf1 pam_rhosts_auth[22917]: allowed to
jawhite at hyprwulf-boot0.hapb as jawhite
Sep 13 14:54:48 hyprwulf1 PAM_unix[22917]: (rsh) session opened for user
jawhite by (uid=0)
Sep 13 14:54:48 hyprwulf1 in.rshd[22918]: jawhite at hyprwulf-boot0.hapb as
jawhite:
cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
-p4pg /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/vulcan.hosts
-p4wd /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases'
the /var/log/messages file on node2 contains :
no events during this time frame
the /var/log/messages file on node3 contains :
Sep 13 14:54:48 hyprwulf3 xinetd[11167]: START: shell pid=11395
from=192.168.47.32
Sep 13 14:54:48 hyprwulf3 pam_rhosts_auth[11395]: allowed to
jawhite at hyprwulf-boot1.hapb as jawhite
Sep 13 14:54:48 hyprwulf3 PAM_unix[11395]: (rsh) session opened for user
jawhite by (uid=0)
Sep 13 14:54:48 hyprwulf3 in.rshd[11396]: jawhite at hyprwulf-boot1.hapb as
jawhite:
cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
node2 34232 \-p4amslave'
and the following error message is generated
rm_10957: p4_error: rm_start: net_conn_to_listener failed: 34133
It appear that in case 4 even though I have requested node2 and node3 be
used that a process is
being rhsh'd to node1 instead. The log message from node3 indicates it
expects to connect
to node2 (partial proof that really did request node2) but since there
is no process on
node2 an error occurs.
Is this a mpich bug or am I trying to use mpich incorrectly?
Thanks for any and all help!
Jeff
the /var/log/messages file on node2 contains :
Sep 13 15:49:32 hyprwulf2 xinetd[13163]: START: shell pid=13490
from=192.168.47.32
Sep 13 15:49:32 hyprwulf2 pam_rhosts_auth[13490]: allowed to
jawhite at hyprwulf-boot1.hapb as jawhite
Sep 13 15:49:32 hyprwulf2 PAM_unix[13490]: (rsh) session opened for user
jawhite by (uid=0)
Sep 13 15:49:32 hyprwulf2 in.rshd[13491]: jawhite at hyprwulf-boot1.hapb as
jawhite:
cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver
node1 34248 \-p4amslave'
and the program executes successfully
--
Jeffery A. White
email : j.a.white at larc.nasa.gov
Phone : (757) 864-6882 ; Fax : (757) 864-6243
URL : http://hapb-www.larc.nasa.gov/~jawhite/
More information about the Beowulf
mailing list