[Beowulf] backtraces

Mon Jun 11 21:55:02 PDT 2007

I've tried to stay out of this.  Really, I have.

Craig Tierney wrote:
> Mark Hahn wrote:
>>> Sorry to start a flame war....
>>
>> what part do you think was inflamed?
> 
> It was when I was trying to say "Real codes have user-level
> checkpointing implemented and no code should ever run for 7
> days."

A number of my climate simulations will run for 7-10 days to get 
century-long simulations to complete.  I've run geodesy simulations that 
ran for up to 17 days in the past.  I like to think that my codes are 
real enough!

Real codes do have user-level checkpointing, though.  And even better 
codes can be restarted without a lot of user intervention by invoking a 
run-time flag and going off for coffee.

>>> Make sure that your code generates the exact same answer with 
>>> debug/backtrace enabled and disabled,
>>
>> part of the point of my very simple backtrace.so is that it has zero 
>> runtime overhead and doesn't require any special compilation.
>>
> 
> Does the Intel version have overhead?  I never measured it before,
> but I never thought it was much.

Can't speak to the Intel compiler, as with their terms of use I've 
abandoned it and never tried its traceback or checkpointing 
capabilities.  PGI, which I do use, and old IBM Fort-G and Fort-H did 
have overhead issues.  The PGI compiler is what I tend to use almost all 
the time for my model compiling so I'm not able to speak to must of this 
new-fangled language stuff you're talking about :-)

>>> then you add user-level checkpointing so that you can
>>
>> I'm most curious to hear people's experience with checkpointing.
>> all our more serious, established codes do checkpointing, but it's 
>> extremely foreign to people writing newish codes.
>> and, of course, it's a lot of extra work.  I'm not arguing against
>> checkpointing, just acknowledging that although we _require_ it,
>> we don't actually demand "proof-of-checkpointability".
>>
> 
> I included checkpointing in an ocean-model once.  It was very easy,
> but that was most likely because of how it was organized (Fortran 77,
> most data structures were shared).
> 
> I don't think that it is foreign to people writing new codes.
> It is foreign to scientists.  Software developers (who could be
> scientists) would think of this from the beginning (I hope).

Let's see.  WRF and MM5 on the atmospheric front, support user-level 
checkpointing and restart capabilities.  So does ADCIRC and Wave 
Watch-III.  And ROMS.  So, the oceans side is covered.  The older *nix 
version of PAGES (geodesy) didn't but it was easily added.  Most folks 
didn't use PAGES like I did, and thus checkpointing was pretty useless. 
  I'm not dabbling in genomics or protein folding but most of the folks 
I know who are, are computer scientists who "followed the money" and are 
collaborating on projects with discipline scientists, implementing code 
to support the "real" work.  So, I strongly suspect they're implementing 
checkpointing, too.

>>> restart where you want.  Then you
>>> run up until the problem and restart with the last checkpoint.
>>
>> restarting from checkpoint is fine (the code in question could
>> actually do it), but still means you have hours of running,
>> presumably under a debugger.
>>
>>> Run for a week without checkpointing?  Just begging for trouble.
>>
>> suppose you have 2k users, with ~300 active at any instant,
>> and probably 200 unrelated codes running.  while we do require
>> checkpointing (I usually say "every 6-8 cpu hours"), I suspect that 
>> many users never do.  how do you check/validate/encourage/support
>> checkpointing?
>>
> 
> Set your queue maximums to 6-8 hours.  Prevents system hogging,
> encourages checkpointing for long runs.  Make sure your IO system
> can support the checkpointing because it can create a lot of load.

And how do you support my operational requirements with this policy 
during hurricane season?  Let's see... "Stop that ensemble run now so 
the Monte Carlo chemists can play for  awhile, then we'll let you back 
on.  Don't worry about the timeliness of your simulations.  No one needs 
a 35-member ensemble for statistical forecasting, anyway."  Did I miss 
something?

Yeah, we really do that.  With boundary-condition munging we can run a 
statistical set of simulations and see what the probabilities are and 
where, for instance, maximum storm surge is likely to go.  If we don't 
get sufficient membership in the ensemble, the statistical strength of 
the forecasting procedure decreases.

Gerry

>> part of the reason I got a kick out of this simple backtrace.so
>> is indeed that it's quite possible to conceive of a checkpoint.so
>> which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly decent 
>> job of checkpointing at least serial codes non-intrusively.
>>
> 
> BTW, I like your code.  I had a script written for me in the past
> (by Greg Lindahl in a galaxy far-far away).  The one modification
> I would make is to print out the MPI ID evnironment variable (MPI
> flavors vary how it is set).  Then when it crashes, you know which
> process actually died.
> 
> Craig
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University	
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843