[Beowulf] the solution for qdel fail.....

Jerry Xu jerry at oban.biosc.lsu.edu
Mon Jan 10 07:49:12 PST 2005


Hi, William, Thank for your information. Just in case somebody still
need it for openPBS configuration, here is my epilogue file.it shall be
located in $pbshome/mom_priv/ for each node and it need to be set as
executable and owned by root. Some others many have better epilogue
scripts...


/*****************************************************/
echo '------------clean up------------'
echo running pbs epilogue script
                                                                                
# set key variables
USER=$2
NODEFILE=/var/spool/pbs/aux/$1
                                                                                
echo
echo killing processes of user $USER on the batch nodes
for node in `cat $NODEFILE`
do
       echo Doing node $node
       su $USER -c "ssh $node skill -KILL -u $USER"
done
echo Done

/****************************************************/




On Thu, 2005-01-06 at 17:56, William Scullin wrote:
> Howdy,
> 
> 	The --gm-kill is specific to clusters using myrinet and mostly is there
> to ensure that slave processes using myrinet's mpi hang up when the
> master process is done running. The number after the --gm-kill is the
> timeout in seconds.
> 
> 	I am not sure which version, type, or member of the PBS family you are
> using. If you are using PBS Pro (also probably true for torque and Open
> PBS), you should be able to place two scripts in
> /var/spool/PBS/mom_priv/ called prologue and epilogue on every compute
> node. They must be owned by root and be executable / readable / writable
> only by root. The prologue script will run before every job and the
> epilogue script will run after every job. In the epilogue and prologue
> scripts we use, we clean the nodes of all lingering user processes and
> do some basic checking of node health.
> 
> 	Even if an epilogue script misses a process – or a user a user launches
> a process outside of the queuing system – the prologue will still catch
> it before the next job starts to run.
> 
> 	Best,
> 	William
>  
> On Thu, 2005-01-06 at 14:33, Jerry Xu wrote:
> > Hey, Huang:
> > 
> >   I found one solution that works for me, maybe you can try it and see
> > whether it works for you.
> > 
> > in your pbs script, try to add this "kill -gm 5" syntax between the
> > processor number and your program
> > 
> > like this 
> > 
> > mpirun -machinefile $PBS_NODEFILE -np $NPROCS --gm-kill 5 myprogram
> > 
> > it works for me.
> > 
> > Jerry.
> > 
> > /**********************************************************
> > Hi,
> > 
> > We have a new system set up. The vendor set up the PBS for us. For
> > administration reasons, we created a new queue "dque" (set to default)
> > using the "qmgr" command:
> > 
> > create queue dque queue_type=e
> > s q dqueue enabled=true, started=true
> > 
> > I was able to submit jobs using the "qsub" command to queue "dque".
> > However, when I use "qdel" to kill a job, the job disappears from the
> > job list shown by "qstat -a", but the executable is still running on
> > the compute nodes. Every time I have to login the corresponding the
> > compute node and kill the running job.
> > 
> > I am wondering if I missed something in setting up the queue so that I
> > am unable to kill the job completely using "qdel".
> > 
> > Thanks.
> > 
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> William Scullin
> System Administrator
> Center for Computation and Technology
> 342 Johnston Hall
> Louisiana State University
> Baton Rouge, Louisiana 70803
> voice:	225 578 6888
> fax:	225 578 5362
> aim:	WilliamAtLSU
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




More information about the Beowulf mailing list