becker at scyld.com
Sat Jan 4 09:58:51 PST 2003
On Sat, 4 Jan 2003, Randall Jouett wrote:
[[ Topic: POV-Ray modified to use BeoMPI. ]]
> >>> It completes the rendering even with crashed or slow nodes.
> >>Ah. So it redistributes the work, huh? Kewl.
> > Here we use knowledge about the application semantics to implement
> > failure tolerence. When we have idle workers and the rendering isn't
> > finished, we send some of the remaining work to the idle machine.
> Well, I hate to sound like a knothead here, Donald, and I don't
> mean to be rude, but isn't this a defacto setup and standard in
> a beowulf environment?? If not, what the hell are people thinking
> about? :^) :^). To me, this just seems like the logical way to
> write code, but the heck do I know? :^)
Not at all! MPI does not handle faults. Most MPI applications just
fail when a node fails. A few periodically write checkpoint files, and
a subset ;-) of those can be re-run from the last checkpoint.
With the POV-Ray port I used application specific knowledge and explicit
code to re-issue the work and handle duplicate results. You can use the
same idea (but unique code) with other MPI applications that don't have
side effects within the time step.
Although the program completes the rendering, there is still much
ugliness when a partially-failed MPI program tries to finish.
Donald Becker becker at scyld.com
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Scyld Beowulf cluster system
Annapolis MD 21403 410-990-9993
More information about the Beowulf