[Beowulf] backtraces
Craig Tierney
ctierney at hypermall.net
Tue Jun 12 14:48:33 PDT 2007
> Several points in here.
> 1. Preemption is one approach I finally got the admin to buy into for
> forecasting codes.
> 2. MY operational codes for an individual simulation don't take long to
> run, save the fact that we don't do a 12 hr hurricane sim, but an 84
> hour sim for the weather side (WRF). Saving grace here is that the
> nested grids are not too large so they can run to completion in a couple
> of wall-clock hours.
> 3. When one starts trying to twiddle initial conditions statistically
> to create an ensemble, one then has to run all the ensemble members. One
> usually starts with central cases first, especially if one "knows" which
> are central and which are peripheral. If one run takes 30 min on 128
> processors, and one thinks one needs 57 members run, one exceeds a
> wall-clock day. And needs a bigger, faster computer, or at least a
> bigger queue reservation. If one does this without preemption, one gets
> all results back at the end of the hurricane season and declares success
> after 3 years of analysis instead of providing data in near real time.
>
So there are 57 jobs of 30 minutes each. Get your user to rewrite their
scripts so it isn't one job. That shouldn't be too hard.
> Part of this involves the social engineering required on my campus to
> get HPC efforts to work at all... Alas, nothing has to do with backtraces.
Very true (on both parts).
Craig
>
> gerry
>
>>> Yeah, we really do that. With boundary-condition munging we can run
>>> a statistical set of simulations and see what the probabilities are
>>> and where, for instance, maximum storm surge is likely to go. If we
>>> don't get sufficient membership in the ensemble, the statistical
>>> strength of the forecasting procedure decreases.
>>>
>>> Gerry
>>>
>>>>> part of the reason I got a kick out of this simple backtrace.so
>>>>> is indeed that it's quite possible to conceive of a checkpoint.so
>>>>> which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly
>>>>> decent job of checkpointing at least serial codes non-intrusively.
>>>>>
>>>>
>>>> BTW, I like your code. I had a script written for me in the past
>>>> (by Greg Lindahl in a galaxy far-far away). The one modification
>>>> I would make is to print out the MPI ID evnironment variable (MPI
>>>> flavors vary how it is set). Then when it crashes, you know which
>>>> process actually died.
>>>>
>>>> Craig
>>>>
>>>> _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>> To change your subscription (digest mode or unsubscribe) visit
>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>
>>
>
More information about the Beowulf
mailing list