Two heads are better than one! :)

Thu Oct 31 19:27:30 PST 2002

Joe,

Thanks, that's a good point. This becomes more or less of
an issue, I suppose, depending on the length of time it
takes any single "run" of an application to complete. In
the past, most of our focus from a systems redundancy
perspective has been to minimize the amount of time it
would take to get the users back up and running in the
event of a system failure; if it takes us more than an
hour or so we start feeling bad about it.  Checkpointing
or other restart capability has been more of an user
responsibility -- they know if they can't find a way to
break the job up and it winds up taking weeks for a single
thread to complete they are running a risk we can't much
help them with. Luckily for us (the system builders and
managers), it is mostly the research (read: low priority)
stuff that tends to run for weeks.

These are econometric models and simulations, FWIW.
Generally, production applications are designed to be
completable by deadline with a reasonable margin of safety,
while the analysts don't feel so constrained when selecting
problems for research projects, which often have no firm
deadlines.

Still, with the application parallelized and scattered to
a number of machines like this, there are new dimensions
to this problem. As Don mentioned in another post in this
thread, one issue is what you do with the processes that
continue to run on the cluster compute nodes if the master
fails -- what is the recovery/cleanup procedure and how
much can you salvage.

Beyond designing the system to be as fault-resilient
as possible, of course, this all becomes an exercise
in expectation management -- making sure that the user
community knows where the risks are and where they
aren't, so that they can plan accordingly. Your point is
particularly valuable in that regard. :-)

--Bob

On Thu, Oct 31, 2002 at 09:26:55PM -0500, Joseph Landman wrote:
> 
> On Thu, 2002-10-31 at 20:56, Bob Drzyzgula wrote:
> 
> > Thus, the question becomes whether any of the various
> > cluster APIs and services such as PVM, MPI, BPROC, PBS,
> > etc. are dependant on the selection of a single, exclusive
> > master. Clearly if multiple simultaneously operating
> > masters are tolerated in the API, you can just have
> > multiple head nodes which are available all the time. If
> > an API requires a single master, one might have to effect
> > some sort of manual switch-over in the event of a head
> > node failure; this would then raise the question of the
> > complexity of such a switch-over, e.g. would compute node
> > reconfiguration be required or would it simply be a matter
> > of starting up the controller service on a new system.
> 
> Hi Bob:
> 
>   It is more complex than that, in that you would need to preserve state
> changes over the length of the program, and PVM/MPI/et al do not
> preserve this state information.  The folks at LANL had a fault tolerant
> MPI at one point, but I haven't heard much of it recently.
> 
> Joe