Fault tolerance and MPI

Tony Skjellum tony at MPI-Softtech.Com
Mon Feb 5 06:49:12 PST 2001


You can see our initial paper on this subject at

http://www.mpi-softtech.com/publications/mpift-paper-dsm2001.pdf

It contains references to other known works in this area.

-Tony

Anthony Skjellum, PhD, President (tony at mpi-softtech.com) 
MPI Software Technology, Inc., Ste. 33, 101 S. Lafayette, Starkville, MS 39759
+1-(662)320-4300 x15; FAX: +1-(662)320-4301; http://www.mpi-softtech.com
"Best-of-breed Software for Beowulf and Easy-to-Own Commercial Clusters."

On Mon, 5 Feb 2001 Carl_Notfors at vdgc.com.sg wrote:

> 
> 
> Our computational model is quite simple.  We have a master node and a
> number of slave nodes.  All communication is between the master and the
> slaves, ie. no internode communication, so all communication is done with
> MPI_Send and MPI_Recv (we are using LAM/MPI).
> 
> The problem with MPI is that there is no fault tolerance, if a slave node
> "dies" the whole process goes down.  According to the LAM documentation it
> should be possible to achieve some fault tolerance but we have as yet not
> tried this.
> 
> Is there anyone who has got this working?  Is there fault tolerance in any
> othe MPI implementations?  Would it be better to use PVM if you want fault
> tolerance?
> 
> 
> Carl
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 





More information about the Beowulf mailing list