j.c.burton at gats-inc.com
Mon Jan 6 07:36:27 PST 2003
Randall Jouett wrote:
> On Sat, 2003-01-04 at 11:58, Donald Becker wrote:
>>Not at all! MPI does not handle faults. Most MPI applications just
>>fail when a node fails. A few periodically write checkpoint files, and
>>a subset ;-) of those can be re-run from the last checkpoint.
> Checkpoint files? BLAH!!! :^). Admittedly, I'm a total neophyte
> when it comes to parallel processing and the beowulf architecture,
> but computin is computin, and I think I might have a "el-cheapo,
> ham-operator solution." (Hams are infamous be being TOTAL cheapskates
Ummm...'all "computin" ain't equal'. While checkpoint files might not be
useful for what you do, they save thousands of machine and man hours in
my business. We have gigabytes of raw data from satellites being
recorded per day. Processing a day's worth of data requires 2 days on a
2.5ghz P-4. So, divide the data into orbits and process the orbits in
parallel. The mathematical model is such that fine-grained parallel
processing is not practical at this time (massive redesign and the
scientists don't understand parallel). If a process dies, then we can go
back to the logs and correct the problem and restart from the last
checkpoint (which was a minute or so ago) instead of starting over at
the begining, which could be as much as 24 hours ago...
> Off the top of my head, why couldn't you just plug in an old
> 10Base-T card to each node. Add a server node that specifically
> polls each machine via hardware latch and software response.
> Just a quick, "Hey, I'm still here." This fault server would
> then send the root/head node a quick "we're running, boss!"
> message, or it would tell the root/head node that a particular
> machine was down. If the root machine sees a fault message,
> it parses the packet, ignores the broken node, then reschedules
> the task for execution. It could also send an e-mail to the
> sysadmin, page him, and even play a "RED ALERT!" sample from
> Trek :^).
Apparently you are not current on cluster technology, or you wouldn't be
proposing something that is common knowledge.
> Now, if your REALLY wanted to be cheap :^), you could do something
> like this with a USB hub, although I'm pretty sure it wouldn't
> be as fast as the 10Base-T setup. OTOH, 10Base-T gear (e.g. hub,
> switch, NICs) can probably be had for the asking at most
> institutions, I'd imagine.
10Base-T is too slow for typical parallel application. Switched
100Base-T is almost as inexpensive.
> BTW, has anyone bothered to calculate all the wasted cycles
> used up by check-point files? :^).
Yup, and it is significantly less than the number of cycles that would
be wasted having to rerun 24 hours worth of processing because a machine
hiccuped and the process died...
> Randall Jouett
> Amateur Radio: AB5NI
> The model I mentioned does have its flaws, of course, such
> as a switch or hub going down, or maybe a busted CAT-5 cable
> here or there. Something tells me, though, that it HAS to be
> infinitely superior to check-point files and the like :^).
> That is, if I'm understanding your meaning here of check-point
> files. If I'm off base here, Donald, maybe you could clarify?
In my world a check point file is a "snapshot" of the state of running
process at a given time. This "snapshot" is complete enough to restart
the process at that point should it fail at a later point.
More information about the Beowulf