rules at bellsouth.net
Mon Jan 6 08:31:42 PST 2003
Hello again, Donald/gang.
On Sun, 2003-01-05 at 15:13, Donald Becker wrote:
> On 5 Jan 2003, Randall Jouett wrote:
> > On Sat, 2003-01-04 at 11:58, Donald Becker wrote:
> > >
> > > Not at all! MPI does not handle faults. Most MPI applications just
> > > fail when a node fails. A few periodically write checkpoint files, and
> > > a subset ;-) of those can be re-run from the last checkpoint.
> > Checkpoint files? BLAH!!! :^). Admittedly, I'm a total neophyte
> Application-specific checkpoint files are sometimes the only effective
> way to handle node crashes.
> > Off the top of my head, why couldn't you just plug in an old
> > 10Base-T card to each node. Add a server node that specifically
> The problem isn't just detecting that a node has failed (which is either
> trivial or impossible, depending on your criteria), the problems are
> - handling a system failure during a multiple-day run.
> - handling partially completed work issued to a node
> - killing processes/nodes that you are think have failed, lest they
> complete their work later.
Ah. Ok. I understand now. Thanks for the info.
> > BTW, has anyone bothered to calculate all the wasted cycles
> > used up by check-point files? :^).
> Checkpointing is very expensive, and most of the time the checkpoint
> isn't used. This is why only application-specific checkpoinging makes
> sense: the application writer known which information is critical, and
> when everything is consistent. Machines that save the entire memory
> space have been known to take the better part of an hour to roll out a
> large job.
An hour? Dang.
> > > Although the program completes the rendering, there is still much
> > > ugliness when a partially-failed MPI program tries to finish.
> > Hmmm. Why aren't folks flagging the node as dead and ignoring
> > any other output until the node is back up and saying it's
> > ready to run. This would have to be verified by the sysadmin,
> > of course.
> The issue is the internal structure of the MPI implementation: there is
> no way to say "I'm exiting sucessfully even though I know some processes
> might be dead." Instead what happens is that the library call waits
> around for the dead children to return.
I take it that you're talking about a compute node when you're saying
all of this, and I'm also reading processes here as "the other nodes."
Remember, I'm a neophyte, Donald :^).
Anywho, I was thinking that the lib call was written in an asynchronous
fashion, with various flags being set on the root node when a compute
node completed its computation. Also, the only way the root would
continue on with the application is when all nodes sent a response
saying that they're done.
I also don't see why you couldn't make a few test runs, average
out the response time of each node and the overall process (if
necessary), and stick this info into a database for a given app
on the root node. (Off the top of my head, of course. I'll have to think
about this a while.) To me, this seems like you'd be adding a certain
level of fault tolerance at the software level.
Now, if we set up a response-time window for an individual compute node
and the root node thinks that it has fallen out of the window, then it
seems to me that the root node could flag the node as having temporary
problems, and then it could shift that nodes work over to the first node
that has completed its calculations/processing. Should the problematic
node straighten itself out and start responding again -- let's say it
finished its processing -- then the data is taken from that node
verbatim and stored off to disk. That is, if the recovery node didn't
finish its work already, of course. You'd also have to tell the original
node that straightened itself out "Never mind," of course. (Said with
Lilly Tomlin intonations :^) ).
> This brings us back to the liveness test for compute nodes. When do we
> decide that a node has failed? If it doesn't respond for a second? A
> transient Ethernet failure might take up to three seconds to restart
> link (typical is 10 msec.) Thirty seconds?
If it gets out of its window, you set things up so that the first
node to complete its computations takes over its work load. If node
acting up straightens itself out, great. Then you just kill the
request for node recovery and things should just "keep on trucking."
(Wow. That last remark is really showing my age :^)). At this point,
I guess you'd also want to increase the size of the window on the
node that's acting up, too. Also, if the node doesn't respond in
at twice the window size, I guess you could display a message
on the console, remove the node from computations, and let the
sysadmin take a look at the machine too see if anything is awry
with the node or the network. More than likely, hiccups would involved
latency, possibly do to fragmentation or the like. God forbid a memory
If you really wanted to get spiffy, I guess you could work a
neural net into the system, having monitor network it traffic
and such. A setup like this might be able to warn you if glitches
were getting ready to rear their ugly heads. :^)
> A machine running a 2.4 kernel before 2.4.17 might take minutes to
> respond when recovering from a tempory memory shortage, but run just >
> fine later.
In the model I just described, I don't think this would be a problem.
(Shrug.) I still have to think about all of this, of course. The one
thing I really like about a model like this is that it would be
asynchronous, and you could get away with simplistic levels of
message passing. Just open a socket, read packets, and write packets.
BTW, I have read and understood everything you've said, Donald,
and I thank you wholeheartedly for the explanation. The way I
responded, though, you'd think that I already knew what I was
talking about. Witout any doubts -- I don't! :^). I wrote my
response this way, though, so that you and others can straighten
me out if I'm looking at parallel processing in a bass-ackwards
fashion. If not, then maybe the model I descrided is exactly how
things work already. If not, then maybe it might be worth looking
into further. (Shrug.)
Ok. I've been up all night with the flu. I need to try to get
some sleep, and answer the rest of the my beowulf e-mails later
on tonight, if I'm feeling better.
73 (Best Regards in morse code...ham lingo :^),
Amateur Radio: AB5NI
More information about the Beowulf