[Beowulf] backtraces

Tue Jun 12 09:19:53 PDT 2007

Craig Tierney wrote:
> Gerry Creager wrote:
>> I've tried to stay out of this.  Really, I have.
>>
>> Craig Tierney wrote:
>>> Mark Hahn wrote:
>>>>> Sorry to start a flame war....
>>>>
>>>> what part do you think was inflamed?
>>>
>>> It was when I was trying to say "Real codes have user-level
>>> checkpointing implemented and no code should ever run for 7
>>> days."
>>
>> A number of my climate simulations will run for 7-10 days to get 
>> century-long simulations to complete.  I've run geodesy simulations 
>> that ran for up to 17 days in the past.  I like to think that my codes 
>> are real enough!
>>
> 
> NCAR and GFDL run climate simulations for weeks as well.  How longest
> period of time any one job can run?  It is 8-12 hours.  I can verify
> these numbers if needed, but I can guarantee you that no one is allowed
> to put their job in for 17 days.  With explicit permission they may get
> 24 hours, but that would be for unique situations.

On the p575, we have similar constraints and I do work within those.  In 
my lab, I can control access a bit more and have considerably fewer (and 
truly grateful) users, so if we need to run "forever" we can implement 
that.

>> Real codes do have user-level checkpointing, though.  And even better 
>> codes can be restarted without a lot of user intervention by invoking 
>> a run-time flag and going off for coffee.
>>
> 
> You mean there are people that bother to implement checkpointing and
> then don't make it code like:
> 
> if (checkpoint files exist in my directory) then
>    load checkpoint files
> else
>    start from scratch
> end
> 
> ????

Yes, there are.  No, I'm not one of them.  My stuff does do a restart if 
it stops and finds evidence of a need to continue.  However, I've seen 
this failure time and time again over the years.

>>> Set your queue maximums to 6-8 hours.  Prevents system hogging,
>>> encourages checkpointing for long runs.  Make sure your IO system
>>> can support the checkpointing because it can create a lot of load.
>>
>> And how do you support my operational requirements with this policy 
>> during hurricane season?  Let's see... "Stop that ensemble run now so 
>> the Monte Carlo chemists can play for  awhile, then we'll let you back 
>> on.  Don't worry about the timeliness of your simulations.  No one 
>> needs a 35-member ensemble for statistical forecasting, anyway."  Did 
>> I miss something?
>>
> 
> You kick-off the users that are not running operational codes because
> their work is (probably) not as time constrained.  Also, if you take
> so long to get your answer in an operational mode that the answer 
> doesn't matter anymore, you need a faster computer.  I would think that
> if you cannot spit out a 12-hour hurricane forecast in a couple of
> hours I would be concerned how valuable the answer would be.

Several points in here.
1.  Preemption is one approach I finally got the admin to buy into for 
forecasting codes.
2.  MY operational codes for an individual simulation don't take long to 
run, save the fact that we don't do a 12 hr hurricane sim, but an 84 
hour sim for the weather side (WRF).  Saving grace here is that the 
nested grids are not too large so they can run to completion in a couple 
of wall-clock hours.
3.  When one starts trying to twiddle initial conditions statistically 
to create an ensemble, one then has to run all the ensemble members. 
One usually starts with central cases first, especially if one "knows" 
which are central and which are peripheral.  If one run takes 30 min on 
128 processors, and one thinks one needs 57 members run, one exceeds a 
wall-clock day.  And needs a bigger, faster computer, or at least a 
bigger queue reservation.  If one does this without preemption, one gets 
all results back at the end of the hurricane season and declares success 
after 3 years of analysis instead of providing data in near real time.

Part of this involves the social engineering required on my campus to 
get HPC efforts to work at all...  Alas, nothing has to do with backtraces.

gerry

>> Yeah, we really do that.  With boundary-condition munging we can run a 
>> statistical set of simulations and see what the probabilities are and 
>> where, for instance, maximum storm surge is likely to go.  If we don't 
>> get sufficient membership in the ensemble, the statistical strength of 
>> the forecasting procedure decreases.
>>
>> Gerry
>>
>>>> part of the reason I got a kick out of this simple backtrace.so
>>>> is indeed that it's quite possible to conceive of a checkpoint.so
>>>> which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly decent 
>>>> job of checkpointing at least serial codes non-intrusively.
>>>>
>>>
>>> BTW, I like your code.  I had a script written for me in the past
>>> (by Greg Lindahl in a galaxy far-far away).  The one modification
>>> I would make is to print out the MPI ID evnironment variable (MPI
>>> flavors vary how it is set).  Then when it crashes, you know which
>>> process actually died.
>>>
>>> Craig
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit 
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
> 
> 

-- 
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843