mpich question
Jeffery A. White
j.a.white at larc.nasa.gov
Thu Sep 13 10:46:56 PDT 2001
Dear group,
I am trying to figure out how to use the -p4pg option in mpirun and I
am experiencing some difficulties.
My cluster configuration is as follows:
node0 :
machine : Dual processor Supermicro Super 370DLE
cpu : 1 GHz Pentium 3
O.S. : Redhat Linux 7.1
kernel : 2.4.2-2smp
mpich : 1.2.1
nodes1->18 :
machine : Compaq xp1000
cpu : 667 MHz DEC alpha 21264
O.S. : Redhat Linux 7.0
kernel : 2.4.2
mpich : 1.2.1
nodes 19->34 :
machine : Microway Screamer
cpu : 667 MHz DEC alpha 21164
O.S. : Redhat Linux 7.0
kernel : 2.4.2
mpich : 1.2.1
The heterogeneous nature of the machine has made me migrate from using
the -machinefile option to the -p4pg option. I have been
trying to get a 2 processor job to run while submitting the mpirun
command from node0 (-nolocal is specified) and using either nodes 1
and 2 or nodes 2 and 3. If I use the -machinefile approach I am able to
run on any homogeneous combination of nodes. However, if I use the -p4pg
approach I have not been able to run unless my mpi master
node is node1. As long as node1 is the mpi master node then I can use
any one of nodes 2 through 18 as the 2nd processor. THe following 4 runs
illustrates what I have gotten to work as well as
what doesn't work (and the subsequent error message). Runs 1, 2 and 3
worked and run 4 failed.
1) When submitting from node0 using the -machinefile option to run on
nodes 1 and 2 using mpirun configured as:
mpirun -v -keep_pg -nolocal -np 2 -machinefile vulcan.hosts
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
the machinefile file vulcan.hosts contains:
node1
node2
the PIXXXX file created contains:
node1 0
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver
node2 1
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver
and the -v option reports
running
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver on 2
LINUX ch_p4 processors
Created /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/PI10802
and the program executes successfully
2) When submitting from node0 using the -p4pg option to run on
nodes 1 and 2 using mpirun configured as:
mpirun -v -nolocal -p4pg vulcan.hosts
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
the p4pg file vulcan.hosts contains:
node1 0 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
node2 1 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
and the -v options reports
running /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
on 1 LINUX ch_p4 processors
and the program executes successfully
3) When submitting from node0 using the -machinefile option to run on
nodes 2 and 3 using mpirun configured as:
mpirun -v -keep_pg -nolocal -np 2 -machinefile vulcan.hosts
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
the machinefile file vulcan.hosts contains:
node2
node3
the PIXXXX file created contains:
node2 0
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver
node3 1
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver
and the -v options report
running
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver on 2
LINUX ch_p4 processors
Created /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/PI11592
and the program executes successfully
4) When submitting from node0 using the -p4pg option to run on
nodes 2 and 3 using mpirun configured as:
mpirun -v -nolocal -p4pg vulcan.hosts
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
the p4pg file vulcan.hosts contains:
node2 0
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver
node3 1
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver
and the -v options report
running /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
on 1 LINUX ch_p4 processors
and the following error message is genreated
rm_10957: p4_error: rm_start: net_conn_to_listener failed: 34133
Thanks for your help,
Jeffery A. White
email : j.a.white at larc.nasa.gov
Phone : (757) 864-6882 ; Fax : (757) 864-6243
URL : http://hapb-www.larc.nasa.gov/~jawhite/
More information about the Beowulf
mailing list