[Beowulf] Re: Time limits in queues

Thu Jan 17 08:34:19 PST 2008

On Thu, Jan 17, 2008 at 02:53:36PM +0100, Bogdan Costescu wrote:
> On Wed, 16 Jan 2008, Craig Tierney wrote:
> 
> >Our queue limits are 8 hours.
> >...
> >Did that sysadmin who set 24 hour time limits ever analyze the amount
> >of lost computational time because of larger time limits?
> 
> While I agree with the idea and reasons of short job runtime limits, I 
> disagree with your formulation. Being many times involved in 
> discussions about what runtime limits should be set, I wouldn't make 
> myself a statement like yours; I would say instead: YMMV. In other 
> words: choose what fits better the job mix that users are actually 
> running. If you have determined that 8h max. runtime is appropriate 
> for _your_ cluster and increasing it to 24h would lead to a waste of 
> computational time due to the reliability of _your_ cluster, then 
> you've done your job well. But saying that everybody should use this 
> limit is wrong.

Completely agree.

> Furthermore, although you mention that system-level checkpointing is 
> associated with a performance hit, you seem to think that user-level 
> checkpointing is a lot lighter, which is most often not the case. 

Hmmm. A system level checkpoint must save the complete state of the
process to be checkpointed plus all of its siblings/children plus varying
amounts of external state; a machine level checkpoint must save complete
machine(s) state.  A user level checkpoint need only save the data that
define the current state--that could well be a small set of values.

Having written that, it may be *easier* (even cheaper) to expend the
resources to save the complete state than to restructure some suitably
complex code to expose a restart state.  I certainly know an application
that fits that model during most of its runtime. But, at the end of
the day, that is just trading runtime for design/coding/validation
time and the notion's validity depends on which side of the operation
you sit.  Consider this though, if as an admin you only rely on user-
level checkpoint, you *will* end up with an argument from one or more
users about the maximum runtime at some point; with a system (or machine)
checkpoint, you'll likely avoid a lot of agida[1], especially when
unplanned or emergency outages/reprioritzations occur.

> Apart from the obvious I/O limitations that could restrict saving & 
> loading of checkpointing data, there are applications for which 
> developers have chosen to not store certain data but recompute it 
> every time it is needed because the effort of saving, storing & 
> loading it is higher than the computational effort of recreating it - 
> but this most likely means that for each restart of the application 
> this data has to be recomputed. And smaller max. runtimes mean more 
> restarts needed to reach the same total runtime...

As you note, only the application can know that it's easier to recompute
than save and restore.  I suspect many of us can site specific examples
where it's easier to recompute; some could probably also cite cases
where recomputing is faster too...

[1] Hearburn, indigestion, general upset or agitation.

-- 
David N. Lombard, Intel, Irvine, CA
I do not speak for Intel Corporation; all comments are strictly my own.