Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

Beowulf Questions

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Randall Jouett rules at bellsouth.net
Tue Jan 7 10:22:03 PST 2003


On Mon, 2003-01-06 at 10:27, Robert G. Brown wrote:
> Don's point about MPI jobs should not be taken lightly.  Suppose one is
> running a tightly coupled job (one where all the nodes advance
> "together" and where failure of any node and the state data it contains
> implies failure of the overall job) that will take one month to complete
> on 100 nodes.  Let us further suppose that (not unreasonably) the
> probability that at least one node will "fail" and require at least a
> restart during that month is essentially unity.
> 
> The time required to complete the project without checkpointing is
> basically infinity.  The time required to complete the project with a
> checkpoint generated once a day, at the cost of 1/30'th of a day's work
> (close to an hour!) is likely to be about 31 (best of all worlds, no
> failure) and maybe 35 (1-3 failures) days, depending on the number of
> actual failures that occur and how rapidly you are able to repair the
> downed node(s) and restart the job.
> 
> BIG difference between 35 days and infinity...hmmmm

You bet! :^). Thanks for the through explanation, Robert.
Much appreciated!

 
> This, BTW, is one of the reasons that there are relatively few WinXX
> clusters out there.  At least some implementations and installations of
> WinXX (where XX is nearly any flavor you like) have reportedly had
> reliable uptimes on the order of a day under heavy load.  If true, one
> would damn near have to checkpoint every fifteen minutes to get through
> the aforementioned computation at all and it would take a year.  Even a
> single failure per day per 100 nodes is enough to significantly affect
> time of completion of synchronous tasks.

LOL. I love it. OTOH, I'm sure all of us know that WinXX has
always been a total piece of garbage and should have never,
ever won the OS war :^). With Red Hat "trying" to integrate 
KDE and Gnome, though, Linux and the like may someday unseat
the all-powerful and unknowing MicroSloth :^). That is, with
a standardized GUI, I think Linux/Unix has a much better chance
breaking into the biz world in a serious way. I also think that
the Mac OS X is helping a bit in this arena, too.



 
> Without checkpointing (and a lot of folks do run without it as it IS
> often a PITA to implement) one is basically gambling that one's cluster
> will stay up through a computation cycle, and one sets one's
> computational cycle accordingly, making it a form of "checkpointing".
> Experience and arithmetic rapidly teaches one when this is a good bet --
> and when it is not.  The first time you run for a month, only to have a
> node (and the entire computation!) crash a few hours before completion
> when you were COUNTING on the results to complete the paper you're
> presenting at a conference the following week the work to checkpoint may
> not seem so very much after all...;-)

That hasta suck :^).
 
> Last remark:  Randy, you very definitely should take the time to skim
> through the list archives, a book or two on parallel computing and
> beowulfery in general, and maybe the howtos or FAQs before making hard
> pronouncements on what does and doesn't make sense in cluster computing.

Well, if my statements were coming across this way, I most humbly
apologize to you and the list! Basically, I asked my original
questions so that I could find out what exactly people in the
real world were using their clusters for, hoping to use the
garnered information for research.  Unfortunately, I somehow
got tied up in conversation on the list, answering this question
and that, making statements that are relevant in serial-based computing
and seeing if they could be tied to the parallel world in some way,
shape, or form.
 

> This is for a variety of reasons, and you should learn them.  This is
> not intended as a flame, just as a suggestion.

No problemo, Robert! I'm rather thick skinned, so don't even
begin to worry about it. :^)


> Note the following Great Truths:

[ABC's snipped :^)]

OK. I'll shut up and do my homework, Robert :^).
I'll just answer a few more e-mails that were
posted to the list (mainly to complete my thoughts),
and then I'll be quiet and study :^). OTOH, one has
to admit that at least a few of my remarks has stimulated
list activity between members. Do I at least get a C+
for my random-number idea? :^)


Thanks for all the great input, Robert. Much appreciated!

Randall

--
Randall Jouett
Amateur Radio: AB5NI





More information about the Beowulf mailing list