[Beowulf] the solution for qdel fail.....
Jerry Xu
jerry at oban.biosc.lsu.edu
Mon Jan 10 07:49:12 PST 2005
Hi, William, Thank for your information. Just in case somebody still
need it for openPBS configuration, here is my epilogue file.it shall be
located in $pbshome/mom_priv/ for each node and it need to be set as
executable and owned by root. Some others many have better epilogue
scripts...
/*****************************************************/
echo '------------clean up------------'
echo running pbs epilogue script
# set key variables
USER=$2
NODEFILE=/var/spool/pbs/aux/$1
echo
echo killing processes of user $USER on the batch nodes
for node in `cat $NODEFILE`
do
echo Doing node $node
su $USER -c "ssh $node skill -KILL -u $USER"
done
echo Done
/****************************************************/
On Thu, 2005-01-06 at 17:56, William Scullin wrote:
> Howdy,
>
> The --gm-kill is specific to clusters using myrinet and mostly is there
> to ensure that slave processes using myrinet's mpi hang up when the
> master process is done running. The number after the --gm-kill is the
> timeout in seconds.
>
> I am not sure which version, type, or member of the PBS family you are
> using. If you are using PBS Pro (also probably true for torque and Open
> PBS), you should be able to place two scripts in
> /var/spool/PBS/mom_priv/ called prologue and epilogue on every compute
> node. They must be owned by root and be executable / readable / writable
> only by root. The prologue script will run before every job and the
> epilogue script will run after every job. In the epilogue and prologue
> scripts we use, we clean the nodes of all lingering user processes and
> do some basic checking of node health.
>
> Even if an epilogue script misses a process – or a user a user launches
> a process outside of the queuing system – the prologue will still catch
> it before the next job starts to run.
>
> Best,
> William
>
> On Thu, 2005-01-06 at 14:33, Jerry Xu wrote:
> > Hey, Huang:
> >
> > I found one solution that works for me, maybe you can try it and see
> > whether it works for you.
> >
> > in your pbs script, try to add this "kill -gm 5" syntax between the
> > processor number and your program
> >
> > like this
> >
> > mpirun -machinefile $PBS_NODEFILE -np $NPROCS --gm-kill 5 myprogram
> >
> > it works for me.
> >
> > Jerry.
> >
> > /**********************************************************
> > Hi,
> >
> > We have a new system set up. The vendor set up the PBS for us. For
> > administration reasons, we created a new queue "dque" (set to default)
> > using the "qmgr" command:
> >
> > create queue dque queue_type=e
> > s q dqueue enabled=true, started=true
> >
> > I was able to submit jobs using the "qsub" command to queue "dque".
> > However, when I use "qdel" to kill a job, the job disappears from the
> > job list shown by "qstat -a", but the executable is still running on
> > the compute nodes. Every time I have to login the corresponding the
> > compute node and kill the running job.
> >
> > I am wondering if I missed something in setting up the queue so that I
> > am unable to kill the job completely using "qdel".
> >
> > Thanks.
> >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> William Scullin
> System Administrator
> Center for Computation and Technology
> 342 Johnston Hall
> Louisiana State University
> Baton Rouge, Louisiana 70803
> voice: 225 578 6888
> fax: 225 578 5362
> aim: WilliamAtLSU
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
More information about the Beowulf
mailing list