rules at bellsouth.net
Tue Jan 7 10:22:03 PST 2003
On Mon, 2003-01-06 at 10:27, Robert G. Brown wrote:
> Don's point about MPI jobs should not be taken lightly. Suppose one is
> running a tightly coupled job (one where all the nodes advance
> "together" and where failure of any node and the state data it contains
> implies failure of the overall job) that will take one month to complete
> on 100 nodes. Let us further suppose that (not unreasonably) the
> probability that at least one node will "fail" and require at least a
> restart during that month is essentially unity.
> The time required to complete the project without checkpointing is
> basically infinity. The time required to complete the project with a
> checkpoint generated once a day, at the cost of 1/30'th of a day's work
> (close to an hour!) is likely to be about 31 (best of all worlds, no
> failure) and maybe 35 (1-3 failures) days, depending on the number of
> actual failures that occur and how rapidly you are able to repair the
> downed node(s) and restart the job.
> BIG difference between 35 days and infinity...hmmmm
You bet! :^). Thanks for the through explanation, Robert.
> This, BTW, is one of the reasons that there are relatively few WinXX
> clusters out there. At least some implementations and installations of
> WinXX (where XX is nearly any flavor you like) have reportedly had
> reliable uptimes on the order of a day under heavy load. If true, one
> would damn near have to checkpoint every fifteen minutes to get through
> the aforementioned computation at all and it would take a year. Even a
> single failure per day per 100 nodes is enough to significantly affect
> time of completion of synchronous tasks.
LOL. I love it. OTOH, I'm sure all of us know that WinXX has
always been a total piece of garbage and should have never,
ever won the OS war :^). With Red Hat "trying" to integrate
KDE and Gnome, though, Linux and the like may someday unseat
the all-powerful and unknowing MicroSloth :^). That is, with
a standardized GUI, I think Linux/Unix has a much better chance
breaking into the biz world in a serious way. I also think that
the Mac OS X is helping a bit in this arena, too.
> Without checkpointing (and a lot of folks do run without it as it IS
> often a PITA to implement) one is basically gambling that one's cluster
> will stay up through a computation cycle, and one sets one's
> computational cycle accordingly, making it a form of "checkpointing".
> Experience and arithmetic rapidly teaches one when this is a good bet --
> and when it is not. The first time you run for a month, only to have a
> node (and the entire computation!) crash a few hours before completion
> when you were COUNTING on the results to complete the paper you're
> presenting at a conference the following week the work to checkpoint may
> not seem so very much after all...;-)
That hasta suck :^).
> Last remark: Randy, you very definitely should take the time to skim
> through the list archives, a book or two on parallel computing and
> beowulfery in general, and maybe the howtos or FAQs before making hard
> pronouncements on what does and doesn't make sense in cluster computing.
Well, if my statements were coming across this way, I most humbly
apologize to you and the list! Basically, I asked my original
questions so that I could find out what exactly people in the
real world were using their clusters for, hoping to use the
garnered information for research. Unfortunately, I somehow
got tied up in conversation on the list, answering this question
and that, making statements that are relevant in serial-based computing
and seeing if they could be tied to the parallel world in some way,
shape, or form.
> This is for a variety of reasons, and you should learn them. This is
> not intended as a flame, just as a suggestion.
No problemo, Robert! I'm rather thick skinned, so don't even
begin to worry about it. :^)
> Note the following Great Truths:
[ABC's snipped :^)]
OK. I'll shut up and do my homework, Robert :^).
I'll just answer a few more e-mails that were
posted to the list (mainly to complete my thoughts),
and then I'll be quiet and study :^). OTOH, one has
to admit that at least a few of my remarks has stimulated
list activity between members. Do I at least get a C+
for my random-number idea? :^)
Thanks for all the great input, Robert. Much appreciated!
Amateur Radio: AB5NI
More information about the Beowulf