Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

Two heads are better than one! :)

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Donald Becker becker at scyld.com
Thu Oct 31 21:01:47 PST 2002


On 31 Oct 2002, Joseph Landman wrote:

> On Thu, 2002-10-31 at 20:56, Bob Drzyzgula wrote:
...
> > master. Clearly if multiple simultaneously operating
> > masters are tolerated in the API, you can just have
> > multiple head nodes which are available all the time. If
> > an API requires a single master, one might have to effect
> > some sort of manual switch-over in the event of a head
> > node failure; this would then raise the question of the
> > complexity of such a switch-over, e.g. would compute node
> > reconfiguration be required or would it simply be a matter
> > of starting up the controller service on a new system.

We were showing a commercial version of such as system in the HP booth
at LinuxWorld -- a pair of Scyld masters with Steeleye Lifekeeper
handling the fail-over of services if a master fails or an shutdown rule
is triggered.

>   It is more complex than that, in that you would need to preserve state
> changes over the length of the program, and PVM/MPI/et al do not
> preserve this state information.

One rule of thumb: people that application-independent checkpointing is
possible haven't actually considered the implementation and
implications.  In real life the most practical way to handle the issue is
  - having the system handle checkpoint signal support
  - making it easy to write, gather and restore checkpoint files, and
  - providing examples of application-supported checkpointing

> The folks at LANL had a fault tolerant MPI at one point, but I haven't
> heard much of it recently.

I would like to see a paper on the real-life result.  I'm guessing that
the overhead overwhelms any possible saving even with frequent node
failure.  That's exactly the sort of result that makes for a useful
paper -- "You must have a much better idea than this, or it won't work."

-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993




More information about the Beowulf mailing list