[Beowulf] backtraces

Craig Tierney ctierney at hypermall.net
Tue Jun 12 14:48:33 PDT 2007

> Several points in here.
> 1.  Preemption is one approach I finally got the admin to buy into for 
> forecasting codes.
> 2.  MY operational codes for an individual simulation don't take long to 
> run, save the fact that we don't do a 12 hr hurricane sim, but an 84 
> hour sim for the weather side (WRF).  Saving grace here is that the 
> nested grids are not too large so they can run to completion in a couple 
> of wall-clock hours.
> 3.  When one starts trying to twiddle initial conditions statistically 
> to create an ensemble, one then has to run all the ensemble members. One 
> usually starts with central cases first, especially if one "knows" which 
> are central and which are peripheral.  If one run takes 30 min on 128 
> processors, and one thinks one needs 57 members run, one exceeds a 
> wall-clock day.  And needs a bigger, faster computer, or at least a 
> bigger queue reservation.  If one does this without preemption, one gets 
> all results back at the end of the hurricane season and declares success 
> after 3 years of analysis instead of providing data in near real time.

So there are 57 jobs of 30 minutes each.  Get your user to rewrite their
scripts so it isn't one job.  That shouldn't be too hard.

> Part of this involves the social engineering required on my campus to 
> get HPC efforts to work at all...  Alas, nothing has to do with backtraces.

Very true (on both parts).


> gerry
>>> Yeah, we really do that.  With boundary-condition munging we can run 
>>> a statistical set of simulations and see what the probabilities are 
>>> and where, for instance, maximum storm surge is likely to go.  If we 
>>> don't get sufficient membership in the ensemble, the statistical 
>>> strength of the forecasting procedure decreases.
>>> Gerry
>>>>> part of the reason I got a kick out of this simple backtrace.so
>>>>> is indeed that it's quite possible to conceive of a checkpoint.so
>>>>> which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly 
>>>>> decent job of checkpointing at least serial codes non-intrusively.
>>>> BTW, I like your code.  I had a script written for me in the past
>>>> (by Greg Lindahl in a galaxy far-far away).  The one modification
>>>> I would make is to print out the MPI ID evnironment variable (MPI
>>>> flavors vary how it is set).  Then when it crashes, you know which
>>>> process actually died.
>>>> Craig
>>>> _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>> To change your subscription (digest mode or unsubscribe) visit 
>>>> http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list