[Beowulf] Kill zombies after a parallel run
David Kewley
kewley at gps.caltech.edu
Tue May 2 13:02:46 PDT 2006
I don't have a solution for your case, but here's an idea: MPICH-GM (MPICH
for the Myrinet GM protocol) has an option to mpirun.ch_gm that would do
what you want, if you were running Myrinet/GM:
--gm-kill <n> Kill all processes <n> seconds after the first exits.
Other than that, a resource manager may do what you want -- our resource
manager, LSF, does this for us. It even mostly works. :)
David
On Tuesday 02 May 2006 00:49, mg wrote:
> Hi all,
>
> I use MPICH-1.2.5.2 to generate and run an FEM parallel application.
>
> During a parallel run, one process can crash, leaving the other
> processes run and OS commands have to be used for kill these zombies.
> So, does someone have a solution to avoid zombies after a failed
> parallel run: can the crashed process kill the other processes?
>
> Thanks,
> Mathieu
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list