[Beowulf] job runs with mpirun on a node but not if submitted via Torque.
Rahul Nabar
rpnabar at gmail.com
Tue Mar 31 15:54:55 PDT 2009
I've a strange OpenMPI/Torque problem while trying to run a job on our
Opteron-SC-1435 based cluster:
Each node has 8 cpus.
If I got to a node and run like so then the job works:
mpirun -np 6 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}
Same job if I submit through PBS/Torque then it starts running but the
individual processes keep crashing:
mpirun -np 6 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}
I know that the --hostfile directive is not needed in the latest
torque-OpenMPI jobs.
I also tried including:
mpirun -np 6 --hosts node17,node17,node17,node17,node17,node17
${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}
Still does not work.
What could be going wrong? Are there other things I need to worry
about when PBS steps in? Any tips?
The ${DACAPOEXE_PAR} refers to a fortran executable for the
computational chemistry code DACAPO.
What;s the differences between submitting a job on a node via mpirun
directly vs via Torque. Shouldn't these both be transparent to the
fortran calls. I am assuming don't have to dig into the fortran code.
Any debug tips?
Thanks!
--
Rahul
More information about the Beowulf
mailing list