[Beowulf] job runs with mpirun on a node but not if submitted via Torque.
Rahul Nabar
rpnabar at gmail.com
Tue Mar 31 16:58:45 PDT 2009
On Tue, Mar 31, 2009 at 6:43 PM, Don Holmgren <djholm at fnal.gov> wrote:
>
> How are your individual MPI processes crashing when run under Torque? Are
> there any error messages?
Thanks Don! There aren't any useful error messages.
My job hierarchy is actually like so:
{shell_script sumitted to Torque} --> calls Python--> Loop until
convergence {Calls a fortran executable}
The fortran executable is the one that has the mpi calls to parrellize
over processors.
The crash is *not* so bad that torque kills the job. What happens is
that the fortran exec crashes and python continues to loop it over and
over again. The crash is only whenever I submit via torque.
If I do this instead
mpirun fron node --> shell wrapper--> calls Python--> Loop until
convergence {Calls a fortran executable}
Then everything works fine. Note that the Python and shell are not
truely parallelized. The fortran is the only place where actual
parallelization happens.
> The environment for a Torque job on a worker node under openMPI is inherited
> from the pbs_mom process. Sometimes differences between this environment
> and
> the standard login environment can cause troubles.
Exactly. Can I somehow obtain a dump of this environment to compare
the direct mprun vs the torque run? What would be the best way? Just a
dump from set? Any crucial variables to look for? Maybe a ulimit?
>
> Instead of logging into the node directly, you might want to try an
> interactive
> job (use "qsub -I") and then try your mpirun.
I'm trying that now.
--
Rahul
More information about the Beowulf
mailing list