updating the Linux kernel

David Lombard david.lombard at mscsoftware.com
Mon Jun 12 08:11:08 PDT 2000


Crutcher Dunnavant wrote:
> 
> Now, I might completly miss something here, but shouldn't all *distibuted*
> parallel programs assume that a node may not return. After all, what do you
> assume about hardware failures? ...

Um, no.

It all depends upon the software.  PVM does provide the ability to
recover from a node failure, while an MPI program will just tank.

> ... So, while it may not be a *good* way to do it,
> In a properlly paralized application, shouldn't you be able to take down any
> random node other than the job allocation node, AT ANY TIME, and have that job
> reallocated.> reallocated (Yeah, you lose the local work, but those tasks should be
> checkpointed frequently)...

As for checkpointing, that too is an "it depends" answer. 
Application-level checkpointing may be available to varying degrees --
it can be a non-trivial task.  System-level checkpointing generally
can't handle sockets, and that rules out both PVM and MPI.

-- 
David N. Lombard
MSC.Software




More information about the Beowulf mailing list