[Beowulf] backtraces

Craig Tierney ctierney at hypermall.net
Tue Jun 12 14:48:33 PDT 2007


> Several points in here.
> 1.  Preemption is one approach I finally got the admin to buy into for 
> forecasting codes.
> 2.  MY operational codes for an individual simulation don't take long to 
> run, save the fact that we don't do a 12 hr hurricane sim, but an 84 
> hour sim for the weather side (WRF).  Saving grace here is that the 
> nested grids are not too large so they can run to completion in a couple 
> of wall-clock hours.
> 3.  When one starts trying to twiddle initial conditions statistically 
> to create an ensemble, one then has to run all the ensemble members. One 
> usually starts with central cases first, especially if one "knows" which 
> are central and which are peripheral.  If one run takes 30 min on 128 
> processors, and one thinks one needs 57 members run, one exceeds a 
> wall-clock day.  And needs a bigger, faster computer, or at least a 
> bigger queue reservation.  If one does this without preemption, one gets 
> all results back at the end of the hurricane season and declares success 
> after 3 years of analysis instead of providing data in near real time.
> 

So there are 57 jobs of 30 minutes each.  Get your user to rewrite their
scripts so it isn't one job.  That shouldn't be too hard.

> Part of this involves the social engineering required on my campus to 
> get HPC efforts to work at all...  Alas, nothing has to do with backtraces.

Very true (on both parts).

Craig



> 
> gerry
> 
>>> Yeah, we really do that.  With boundary-condition munging we can run 
>>> a statistical set of simulations and see what the probabilities are 
>>> and where, for instance, maximum storm surge is likely to go.  If we 
>>> don't get sufficient membership in the ensemble, the statistical 
>>> strength of the forecasting procedure decreases.
>>>
>>> Gerry
>>>
>>>>> part of the reason I got a kick out of this simple backtrace.so
>>>>> is indeed that it's quite possible to conceive of a checkpoint.so
>>>>> which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly 
>>>>> decent job of checkpointing at least serial codes non-intrusively.
>>>>>
>>>>
>>>> BTW, I like your code.  I had a script written for me in the past
>>>> (by Greg Lindahl in a galaxy far-far away).  The one modification
>>>> I would make is to print out the MPI ID evnironment variable (MPI
>>>> flavors vary how it is set).  Then when it crashes, you know which
>>>> process actually died.
>>>>
>>>> Craig
>>>>
>>>> _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>> To change your subscription (digest mode or unsubscribe) visit 
>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>
>>
> 




More information about the Beowulf mailing list