gerry.creager at tamu.edu
Mon Jun 11 21:55:02 PDT 2007
I've tried to stay out of this. Really, I have.
Craig Tierney wrote:
> Mark Hahn wrote:
>>> Sorry to start a flame war....
>> what part do you think was inflamed?
> It was when I was trying to say "Real codes have user-level
> checkpointing implemented and no code should ever run for 7
A number of my climate simulations will run for 7-10 days to get
century-long simulations to complete. I've run geodesy simulations that
ran for up to 17 days in the past. I like to think that my codes are
Real codes do have user-level checkpointing, though. And even better
codes can be restarted without a lot of user intervention by invoking a
run-time flag and going off for coffee.
>>> Make sure that your code generates the exact same answer with
>>> debug/backtrace enabled and disabled,
>> part of the point of my very simple backtrace.so is that it has zero
>> runtime overhead and doesn't require any special compilation.
> Does the Intel version have overhead? I never measured it before,
> but I never thought it was much.
abandoned it and never tried its traceback or checkpointing
capabilities. PGI, which I do use, and old IBM Fort-G and Fort-H did
have overhead issues. The PGI compiler is what I tend to use almost all
the time for my model compiling so I'm not able to speak to must of this
new-fangled language stuff you're talking about :-)
>>> then you add user-level checkpointing so that you can
>> I'm most curious to hear people's experience with checkpointing.
>> all our more serious, established codes do checkpointing, but it's
>> extremely foreign to people writing newish codes.
>> and, of course, it's a lot of extra work. I'm not arguing against
>> checkpointing, just acknowledging that although we _require_ it,
>> we don't actually demand "proof-of-checkpointability".
> I included checkpointing in an ocean-model once. It was very easy,
> but that was most likely because of how it was organized (Fortran 77,
> most data structures were shared).
> I don't think that it is foreign to people writing new codes.
> It is foreign to scientists. Software developers (who could be
> scientists) would think of this from the beginning (I hope).
Let's see. WRF and MM5 on the atmospheric front, support user-level
checkpointing and restart capabilities. So does ADCIRC and Wave
Watch-III. And ROMS. So, the oceans side is covered. The older *nix
version of PAGES (geodesy) didn't but it was easily added. Most folks
didn't use PAGES like I did, and thus checkpointing was pretty useless.
I'm not dabbling in genomics or protein folding but most of the folks
I know who are, are computer scientists who "followed the money" and are
collaborating on projects with discipline scientists, implementing code
to support the "real" work. So, I strongly suspect they're implementing
>>> restart where you want. Then you
>>> run up until the problem and restart with the last checkpoint.
>> restarting from checkpoint is fine (the code in question could
>> actually do it), but still means you have hours of running,
>> presumably under a debugger.
>>> Run for a week without checkpointing? Just begging for trouble.
>> suppose you have 2k users, with ~300 active at any instant,
>> and probably 200 unrelated codes running. while we do require
>> checkpointing (I usually say "every 6-8 cpu hours"), I suspect that
>> many users never do. how do you check/validate/encourage/support
> Set your queue maximums to 6-8 hours. Prevents system hogging,
> encourages checkpointing for long runs. Make sure your IO system
> can support the checkpointing because it can create a lot of load.
And how do you support my operational requirements with this policy
during hurricane season? Let's see... "Stop that ensemble run now so
the Monte Carlo chemists can play for awhile, then we'll let you back
on. Don't worry about the timeliness of your simulations. No one needs
a 35-member ensemble for statistical forecasting, anyway." Did I miss
Yeah, we really do that. With boundary-condition munging we can run a
statistical set of simulations and see what the probabilities are and
where, for instance, maximum storm surge is likely to go. If we don't
get sufficient membership in the ensemble, the statistical strength of
the forecasting procedure decreases.
>> part of the reason I got a kick out of this simple backtrace.so
>> is indeed that it's quite possible to conceive of a checkpoint.so
>> which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly decent
>> job of checkpointing at least serial codes non-intrusively.
> BTW, I like your code. I had a script written for me in the past
> (by Greg Lindahl in a galaxy far-far away). The one modification
> I would make is to print out the MPI ID evnironment variable (MPI
> flavors vary how it is set). Then when it crashes, you know which
> process actually died.
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843
More information about the Beowulf