[Beowulf] job runs with mpirun on a node but not if submitted via Torque.

Rahul Nabar rpnabar at gmail.com
Tue Mar 31 15:54:55 PDT 2009


I've a strange OpenMPI/Torque problem while trying to run a job on our
Opteron-SC-1435 based cluster:

Each node has 8 cpus.

If I got to a node and run like so then the job works:

mpirun -np 6 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}

Same job if I submit through PBS/Torque then it starts running but the
individual processes keep crashing:

mpirun -np 6 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}

I know that the --hostfile directive is not needed in the latest
torque-OpenMPI jobs.

I also tried including:

mpirun -np 6 --hosts node17,node17,node17,node17,node17,node17
${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}

Still does not work.

What could be going wrong? Are there other things I need to worry
about when PBS steps in? Any tips?

The ${DACAPOEXE_PAR} refers to a fortran executable for the
computational chemistry code DACAPO.

What;s the differences between submitting a job on a node via mpirun
directly vs via Torque. Shouldn't these both be transparent to the
fortran calls. I am assuming don't have to dig into the fortran code.
Any debug tips?

Thanks!

-- 
Rahul



More information about the Beowulf mailing list