Two heads are better than one! :)

Donald Becker becker at scyld.com
Thu Oct 31 21:01:47 PST 2002


On 31 Oct 2002, Joseph Landman wrote:

> On Thu, 2002-10-31 at 20:56, Bob Drzyzgula wrote:
...
> > master. Clearly if multiple simultaneously operating
> > masters are tolerated in the API, you can just have
> > multiple head nodes which are available all the time. If
> > an API requires a single master, one might have to effect
> > some sort of manual switch-over in the event of a head
> > node failure; this would then raise the question of the
> > complexity of such a switch-over, e.g. would compute node
> > reconfiguration be required or would it simply be a matter
> > of starting up the controller service on a new system.

We were showing a commercial version of such as system in the HP booth
at LinuxWorld -- a pair of Scyld masters with Steeleye Lifekeeper
handling the fail-over of services if a master fails or an shutdown rule
is triggered.

>   It is more complex than that, in that you would need to preserve state
> changes over the length of the program, and PVM/MPI/et al do not
> preserve this state information.

One rule of thumb: people that application-independent checkpointing is
possible haven't actually considered the implementation and
implications.  In real life the most practical way to handle the issue is
  - having the system handle checkpoint signal support
  - making it easy to write, gather and restore checkpoint files, and
  - providing examples of application-supported checkpointing

> The folks at LANL had a fault tolerant MPI at one point, but I haven't
> heard much of it recently.

I would like to see a paper on the real-life result.  I'm guessing that
the overhead overwhelms any possible saving even with frequent node
failure.  That's exactly the sort of result that makes for a useful
paper -- "You must have a much better idea than this, or it won't work."

-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993




More information about the Beowulf mailing list