[Beowulf] the solution for qdel fail.....

William Scullin wscullin at cct.lsu.edu
Thu Jan 6 15:56:37 PST 2005


Howdy,

	The --gm-kill is specific to clusters using myrinet and mostly is there
to ensure that slave processes using myrinet's mpi hang up when the
master process is done running. The number after the --gm-kill is the
timeout in seconds.

	I am not sure which version, type, or member of the PBS family you are
using. If you are using PBS Pro (also probably true for torque and Open
PBS), you should be able to place two scripts in
/var/spool/PBS/mom_priv/ called prologue and epilogue on every compute
node. They must be owned by root and be executable / readable / writable
only by root. The prologue script will run before every job and the
epilogue script will run after every job. In the epilogue and prologue
scripts we use, we clean the nodes of all lingering user processes and
do some basic checking of node health.

	Even if an epilogue script misses a process – or a user a user launches
a process outside of the queuing system – the prologue will still catch
it before the next job starts to run.

	Best,
	William
 
On Thu, 2005-01-06 at 14:33, Jerry Xu wrote:
> Hey, Huang:
> 
>   I found one solution that works for me, maybe you can try it and see
> whether it works for you.
> 
> in your pbs script, try to add this "kill -gm 5" syntax between the
> processor number and your program
> 
> like this 
> 
> mpirun -machinefile $PBS_NODEFILE -np $NPROCS --gm-kill 5 myprogram
> 
> it works for me.
> 
> Jerry.
> 
> /**********************************************************
> Hi,
> 
> We have a new system set up. The vendor set up the PBS for us. For
> administration reasons, we created a new queue "dque" (set to default)
> using the "qmgr" command:
> 
> create queue dque queue_type=e
> s q dqueue enabled=true, started=true
> 
> I was able to submit jobs using the "qsub" command to queue "dque".
> However, when I use "qdel" to kill a job, the job disappears from the
> job list shown by "qstat -a", but the executable is still running on
> the compute nodes. Every time I have to login the corresponding the
> compute node and kill the running job.
> 
> I am wondering if I missed something in setting up the queue so that I
> am unable to kill the job completely using "qdel".
> 
> Thanks.
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
William Scullin
System Administrator
Center for Computation and Technology
342 Johnston Hall
Louisiana State University
Baton Rouge, Louisiana 70803
voice:	225 578 6888
fax:	225 578 5362
aim:	WilliamAtLSU
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




More information about the Beowulf mailing list