Beowulf Questions
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Randall Jouett rules at bellsouth.netTue Jan 7 10:22:03 PST 2003
- Previous message: Beowulf Questions
- Next message: Beowulf Questions
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Mon, 2003-01-06 at 10:27, Robert G. Brown wrote: > Don's point about MPI jobs should not be taken lightly. Suppose one is > running a tightly coupled job (one where all the nodes advance > "together" and where failure of any node and the state data it contains > implies failure of the overall job) that will take one month to complete > on 100 nodes. Let us further suppose that (not unreasonably) the > probability that at least one node will "fail" and require at least a > restart during that month is essentially unity. > > The time required to complete the project without checkpointing is > basically infinity. The time required to complete the project with a > checkpoint generated once a day, at the cost of 1/30'th of a day's work > (close to an hour!) is likely to be about 31 (best of all worlds, no > failure) and maybe 35 (1-3 failures) days, depending on the number of > actual failures that occur and how rapidly you are able to repair the > downed node(s) and restart the job. > > BIG difference between 35 days and infinity...hmmmm You bet! :^). Thanks for the through explanation, Robert. Much appreciated! > This, BTW, is one of the reasons that there are relatively few WinXX > clusters out there. At least some implementations and installations of > WinXX (where XX is nearly any flavor you like) have reportedly had > reliable uptimes on the order of a day under heavy load. If true, one > would damn near have to checkpoint every fifteen minutes to get through > the aforementioned computation at all and it would take a year. Even a > single failure per day per 100 nodes is enough to significantly affect > time of completion of synchronous tasks. LOL. I love it. OTOH, I'm sure all of us know that WinXX has always been a total piece of garbage and should have never, ever won the OS war :^). With Red Hat "trying" to integrate KDE and Gnome, though, Linux and the like may someday unseat the all-powerful and unknowing MicroSloth :^). That is, with a standardized GUI, I think Linux/Unix has a much better chance breaking into the biz world in a serious way. I also think that the Mac OS X is helping a bit in this arena, too. > Without checkpointing (and a lot of folks do run without it as it IS > often a PITA to implement) one is basically gambling that one's cluster > will stay up through a computation cycle, and one sets one's > computational cycle accordingly, making it a form of "checkpointing". > Experience and arithmetic rapidly teaches one when this is a good bet -- > and when it is not. The first time you run for a month, only to have a > node (and the entire computation!) crash a few hours before completion > when you were COUNTING on the results to complete the paper you're > presenting at a conference the following week the work to checkpoint may > not seem so very much after all...;-) That hasta suck :^). > Last remark: Randy, you very definitely should take the time to skim > through the list archives, a book or two on parallel computing and > beowulfery in general, and maybe the howtos or FAQs before making hard > pronouncements on what does and doesn't make sense in cluster computing. Well, if my statements were coming across this way, I most humbly apologize to you and the list! Basically, I asked my original questions so that I could find out what exactly people in the real world were using their clusters for, hoping to use the garnered information for research. Unfortunately, I somehow got tied up in conversation on the list, answering this question and that, making statements that are relevant in serial-based computing and seeing if they could be tied to the parallel world in some way, shape, or form. > This is for a variety of reasons, and you should learn them. This is > not intended as a flame, just as a suggestion. No problemo, Robert! I'm rather thick skinned, so don't even begin to worry about it. :^) > Note the following Great Truths: [ABC's snipped :^)] OK. I'll shut up and do my homework, Robert :^). I'll just answer a few more e-mails that were posted to the list (mainly to complete my thoughts), and then I'll be quiet and study :^). OTOH, one has to admit that at least a few of my remarks has stimulated list activity between members. Do I at least get a C+ for my random-number idea? :^) Thanks for all the great input, Robert. Much appreciated! Randall -- Randall Jouett Amateur Radio: AB5NI
- Previous message: Beowulf Questions
- Next message: Beowulf Questions
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
