[Beowulf] Kill zombies after a parallel run

Chris Samuel csamuel at vpac.org
Tue May 2 17:28:20 PDT 2006


On Tuesday 02 May 2006 17:49, mg wrote:

> I use MPICH-1.2.5.2 to generate and run an FEM parallel application.
>
> During a parallel run, one process can crash, leaving the other
> processes run and OS commands have to be used for kill these zombies.
> So, does someone have a solution to avoid zombies after a failed
> parallel run: can the crashed process kill the other processes?

Wild guess time - this is being launched with PBS/Torque and your mpirun is 
using SSH to launch the jobs ?

If that's the case it's not unusual (to quote Tom Jones), and we've seen the 
same here at VPAC.  What we do is encourage all users to use Pete Wyckoff's 
excellent "mpiexec" program (now at version 0.81) at:

	http://www.osc.edu/~pw/mpiexec/index.php

This talks directly to PBS using the TM interface - it retrieves the lists of 
nodes allocated directly (so does not need to be told how many processes to 
start or where) and uses TM to get the mom's to launch a process directly so 
they have direct oversight of them.

When one process dies the mom's notice and mpiexec gets told, so it can reap 
the rest of them.

Best of luck!
Chris
-- 
 Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia




More information about the Beowulf mailing list