[Beowulf] Re: Time limits in queues
Lombard, David N
dnlombar at ichips.intel.com
Thu Jan 17 08:34:19 PST 2008
On Thu, Jan 17, 2008 at 02:53:36PM +0100, Bogdan Costescu wrote:
> On Wed, 16 Jan 2008, Craig Tierney wrote:
>
> >Our queue limits are 8 hours.
> >...
> >Did that sysadmin who set 24 hour time limits ever analyze the amount
> >of lost computational time because of larger time limits?
>
> While I agree with the idea and reasons of short job runtime limits, I
> disagree with your formulation. Being many times involved in
> discussions about what runtime limits should be set, I wouldn't make
> myself a statement like yours; I would say instead: YMMV. In other
> words: choose what fits better the job mix that users are actually
> running. If you have determined that 8h max. runtime is appropriate
> for _your_ cluster and increasing it to 24h would lead to a waste of
> computational time due to the reliability of _your_ cluster, then
> you've done your job well. But saying that everybody should use this
> limit is wrong.
Completely agree.
> Furthermore, although you mention that system-level checkpointing is
> associated with a performance hit, you seem to think that user-level
> checkpointing is a lot lighter, which is most often not the case.
Hmmm. A system level checkpoint must save the complete state of the
process to be checkpointed plus all of its siblings/children plus varying
amounts of external state; a machine level checkpoint must save complete
machine(s) state. A user level checkpoint need only save the data that
define the current state--that could well be a small set of values.
Having written that, it may be *easier* (even cheaper) to expend the
resources to save the complete state than to restructure some suitably
complex code to expose a restart state. I certainly know an application
that fits that model during most of its runtime. But, at the end of
the day, that is just trading runtime for design/coding/validation
time and the notion's validity depends on which side of the operation
you sit. Consider this though, if as an admin you only rely on user-
level checkpoint, you *will* end up with an argument from one or more
users about the maximum runtime at some point; with a system (or machine)
checkpoint, you'll likely avoid a lot of agida[1], especially when
unplanned or emergency outages/reprioritzations occur.
> Apart from the obvious I/O limitations that could restrict saving &
> loading of checkpointing data, there are applications for which
> developers have chosen to not store certain data but recompute it
> every time it is needed because the effort of saving, storing &
> loading it is higher than the computational effort of recreating it -
> but this most likely means that for each restart of the application
> this data has to be recomputed. And smaller max. runtimes mean more
> restarts needed to reach the same total runtime...
As you note, only the application can know that it's easier to recompute
than save and restore. I suspect many of us can site specific examples
where it's easier to recompute; some could probably also cite cases
where recomputing is faster too...
[1] Hearburn, indigestion, general upset or agitation.
--
David N. Lombard, Intel, Irvine, CA
I do not speak for Intel Corporation; all comments are strictly my own.
More information about the Beowulf
mailing list