Beowulf Questions

Tue Jan 7 10:21:58 PST 2003

Howdy Mark, and thanks for the reply.

On Mon, 2003-01-06 at 18:33, Mark Hahn wrote:
> > Anywho, I was thinking that the lib call was written in an asynchronous
> > fashion, with various flags being set on the root node when a compute
> > node completed its computation. Also, the only way the root would
> > continue on with the application is when all nodes sent a response
> > saying that they're done.
> 
> well, that means the master becomes a potential bottleneck.

Hmmm. I think that this would be application specific and really
depend on the situation at hand. That is, I can see where some
applications would only need to worry about the data they were
crunching internally, and they wouldn't have to talk to other
nodes, other than letting the root node know that they've completed.
For other applications where nodes have to communicate with each
other, it would seem that the model could just be duplicated on each
compute node, including the windowing and statistics. Also, a compute
node that's having a problem talking to another compute node could
report this problem to the root/head, making sure that the problem
node is watchdogged or removed from scheduling completely should it fail.
That is, if I understand all of this parallel stuff correctly.

> also consider what happens if the master fails...

No problem. The head/root node could be setup as a
failsafe cluster. Should the head node go down, another
machine just takes over. I think that this would work,
anyway. (Shrug.)

BTW, all these statements shouldn't be seen in a "I know
what the hell I'm talking about" context. Just brainstorming,
and the replies I'm getting on and off list seem to be helping
understand all of this stuff, too. Thanks, guys!.

> > verbatim and stored off to disk. That is, if the recovery node didn't
> > finish its work already, of course. You'd also have to tell the original
> > node that straightened itself out "Never mind," of course. (Said with
> 
> that's fine if each node has only trivial globally unique state.
> but often, the reason you're using parallelism at all is because 
> you have a huge amount of global state, and each of N nodes owns 1/N of it.
> can your program somehow survive when 1/N of its state disappears?

Personally, I'd think survivability would depend on the capabilities of
the people designing the code. I guess you could also set aside a node
or two and use them as failsafe/backup nodes (or whatever the terminology
is used here), and should a node a fail, one of them could take over. This all
depends on if the node taking over would have access to the data.(Check-point
files? UGH! :^) ). BTW, when I say "UGH!" here, I'm really not trashing
check-point file usage. What I guess I'm really saying is that they're a
necessary evil, and also that they are something I'm sure all of us wish
didn't exist in any way, shape, or form. :^). I just hate seeing cycles
being used this way, and if someone could figure out a way to get rid
of them, I'm sure everyone in the field would be MUCH happier :^).

> some codes don't have a lot of state.  for instance, suppose you were
> doing password cracking - a node's state is just its assigned subspace
> within the set of possible cleartext passwords.  if it dies, just 
> hand the space to some other node or distribute it among the survivors.

Yep. Exactly what I was thinking.

> if your problem is like that, you're utterly and completely golden - 
> not only can you handle failures easily, but you can also run just 
> fine on a grid.  like prime-cracking, seti at home, etc.

Yep :^).

Nice talking to you, Mark, and best regards,

Randall

--
Randall Jouett
Amateur Radio: AB5NI