Beowulf Questions

Tue Jan 7 11:42:45 PST 2003

On Tue, Jan 07, 2003 at 12:22:12PM -0600, Randall Jouett wrote:

> With all kidding aside, I can see how (in some applications)
> check-point files are and absolute necessity. My only beef
> with the situation is that a large amount of time is being
> spent doing IO on a "maybe." I do, however, see how they
> can be useful.

Most people don't waste large amount of time. What they do is compare
the average loss of computation due to a failure with the loss of
computation due to the extra I/O.

Example: My machine fails on average every 24 hours. It takes me 1
hour to checkpoint. Therefore if I checkpoint every 8 hours, the
average loss from a failure is 4 hours, and I spent 3 hours doing I/O.

That's an ASCI-class example; most small clusters only need a few
minutes to checkpoint and have a failure every month.

-- greg