hahn at physics.mcmaster.ca
Mon Jan 6 16:33:23 PST 2003
> Anywho, I was thinking that the lib call was written in an asynchronous
> fashion, with various flags being set on the root node when a compute
> node completed its computation. Also, the only way the root would
> continue on with the application is when all nodes sent a response
> saying that they're done.
well, that means the master becomes a potential bottleneck.
also consider what happens if the master fails...
> verbatim and stored off to disk. That is, if the recovery node didn't
> finish its work already, of course. You'd also have to tell the original
> node that straightened itself out "Never mind," of course. (Said with
that's fine if each node has only trivial globally unique state.
but often, the reason you're using parallelism at all is because
you have a huge amount of global state, and each of N nodes owns 1/N of it.
can your program somehow survive when 1/N of its state disappears?
some codes don't have a lot of state. for instance, suppose you were
doing password cracking - a node's state is just its assigned subspace
within the set of possible cleartext passwords. if it dies, just
hand the space to some other node or distribute it among the survivors.
if your problem is like that, you're utterly and completely golden -
not only can you handle failures easily, but you can also run just
fine on a grid. like prime-cracking, seti at home, etc.
More information about the Beowulf