Two heads are better than one! :)
Donald Becker
becker at scyld.com
Thu Oct 31 21:01:47 PST 2002
On 31 Oct 2002, Joseph Landman wrote:
> On Thu, 2002-10-31 at 20:56, Bob Drzyzgula wrote:
...
> > master. Clearly if multiple simultaneously operating
> > masters are tolerated in the API, you can just have
> > multiple head nodes which are available all the time. If
> > an API requires a single master, one might have to effect
> > some sort of manual switch-over in the event of a head
> > node failure; this would then raise the question of the
> > complexity of such a switch-over, e.g. would compute node
> > reconfiguration be required or would it simply be a matter
> > of starting up the controller service on a new system.
We were showing a commercial version of such as system in the HP booth
at LinuxWorld -- a pair of Scyld masters with Steeleye Lifekeeper
handling the fail-over of services if a master fails or an shutdown rule
is triggered.
> It is more complex than that, in that you would need to preserve state
> changes over the length of the program, and PVM/MPI/et al do not
> preserve this state information.
One rule of thumb: people that application-independent checkpointing is
possible haven't actually considered the implementation and
implications. In real life the most practical way to handle the issue is
- having the system handle checkpoint signal support
- making it easy to write, gather and restore checkpoint files, and
- providing examples of application-supported checkpointing
> The folks at LANL had a fault tolerant MPI at one point, but I haven't
> heard much of it recently.
I would like to see a paper on the real-life result. I'm guessing that
the overhead overwhelms any possible saving even with frequent node
failure. That's exactly the sort of result that makes for a useful
paper -- "You must have a much better idea than this, or it won't work."
--
Donald Becker becker at scyld.com
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Scyld Beowulf cluster system
Annapolis MD 21403 410-990-9993
More information about the Beowulf
mailing list