[Beowulf] Killing may user jobs on many compute nodes
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Diego M. Vadell dvadell at linuxclusters.com.arTue Sep 12 08:37:06 PDT 2006
- Previous message: [Beowulf] Killing may user jobs on many compute nodes
- Next message: [Beowulf] detection and diagnosis of PCI bus saturation
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Dan, If you use PBS/torque or some other batch system that could use it, the epilogue script in http://bellatrix.pcl.ox.ac.uk/~ben/pbs/ may help you: " When running parallel jobs on Linux clusters with MPICH and PBS, "slave" MPICH processes are often left behind on one or nodes at job abortion. PBS makes no attempt to clean up processes on any node except the master node, and so these processes can linger for some time. The approach used at the WGR/PDL lab is to kill these processes by means of a second MPI-enabled program, which is run on the same set of nodes that the main job was run on, by the PBS epilogue facility. This program kills all of the user's processes that have the relevant PBS job ID in their environment, so should leave other jobs on the same machine untouched. To set up this system, this C program should be compiled with mpicc and installed as /usr/local/bin/mpicleanup on every MPI node. This epilogue script should then be used by PBS on every node (usually it needs to be installed as /usr/spool/PBS/mom_priv/epilogue) to call the MPICH cleanup program properly at job termination." I have just started using it so I cannot say if it works in all the cases. Hope it helps, -- Diego. On Friday 08 September 2006 12:56, Daniel.G.Roberts at sanofi-aventis.com wrote: > Hello All > > Any one have a method/script that they would be willing to pass along that > could be used in the process of terminating user processes out on the > compute nodes that we never properly cleaned up? Thanks! > Dan > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf
- Previous message: [Beowulf] Killing may user jobs on many compute nodes
- Next message: [Beowulf] detection and diagnosis of PCI bus saturation
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
