[Beowulf] Time measurement

Fri Aug 5 05:02:09 PDT 2005

Vincent Diepeveen writes:

> Always measure wall clock time of execution. That's IMHO the only thing
> that really counts and it includes all overhead.
> 
> Take into account that at most clusters the wall clock time of node A is
> not the same like wall clock of node B.
> 
> If you start a job at the head node and spawn it from there further to
> the rest of the cluster, what counts obviously is the time from when you
> started it, until it finished execution.
> 
> Because in reality what matters is how quickly you did get the job done.

Unless you are trying to time how fast the computer does actual
instructions in a particular context.

Remember, there are lots of reasons to time things.  One is certainly to
time how fast you get a "job" done, where a job is a complex entity with
all sorts of overhead and inefficiencies.  However, there are others,
such as wanting to know how fast a computer can generate random numbers
with a particular algorithm inside a generic loop WITHOUT having the
results affected by the fact that your computer at the time of the test
was running a cron job or dealing with a broadcast storm generated by
some ill-managed system down the wire.  Or how fast it can do a simple
divide operation in a given context, again doing one's very best NOT to
include random delays introduced by an interrupt and/or context switch
that happens to occur right in the middle of the timing interval.  In
these microbenchmark contexts it is actually a PITA to "prepare" the
system in such a way as to make interruptions like this unlikely in the
timing interval(s) and why writing a "good" microbenchmark program is
nontrivial.

Even when timing jobs one has to remember that BECAUSE of the
uncertainty in the "state" of the computer during the timing interval,
your recorded times may or may not be terribly accurate predictors of
actual performance in a different global context or state.  It is
important to do a NUMBER of measurements if possible, ideally with some
degree of knowledge and control of system state, and at least eyeball
the statistics of the results.  

Doing this has revealed lots of interesting things (on this list, even)
over the years such as huge delays "randomly" inserted in TCP streams,
anomalies in the rate at which processors perform particular
instructions (for example, multiplying by a power of two in C code is
often a TERRIBLE predictor for how fast a processor multiples even in an
instruction form such as

  a[i] = 2.0*b[i];

because modern processors optimize such operations and perform them much
faster than

  a[i] = 3.14159*b[i];

) anomalies in the performance of all sorts of network adapters (some of
which work(ed) fine for short traffic bursts but collapsed on the floor
screaming if fed a system-saturating stream of small packets).
Generating an actual histogram of timings is good.  Look for outliers
(if any) -- these are an indication that something highly nonlinear and
state dependent is going on in your code (or on your system) and is
often a place to focus optimization energy.

With all that said, yes, using wall time is a very good thing to do,
ideally measured with the system CPU timer itself.  gettimeofday() in
the past has had a call resolution of about 2 usec (2000 nanoseconds)
depending on how it is implemented (I think it it moving or has moved
towards being implemented on top of the CPU timer where possible).  The
CPU timer can yield a time resolution of 40-70 nanoseconds per timing
call pair, which of course can be improved with good statistics and/or
inlining the timing assembler and avoiding subroutine calls.

A final good thing to do is to remember profiling.  Even "jobs" --
perhaps especially "jobs" -- benefit from profiling.  The times won't be
terribly accurate because of job instrumentation and so on, but getting
a good idea of where your job is spending most of its time can be a
surprising and rewarding thing to do.  Surprising because it might not
be where you think it is; rewarding because once you know where it is
doing a lot of work you may be able to rearrange it for improved
performance.

    rgb

> 
> Vincent
> 
> At 08:13 AM 8/1/2005 -0700, ThanhVu  H. Nguyen - Gmail wrote:
>>Hi, just wondering what the standard way of measure the execution time
>>?  
>>
>>2 methods I thought about are: 
>>
>>1) /usr/bin/time  prog   : this includes all the communcation, i/o ,
>>loading overhead etc
>>
>>2) include start_time , end_time code in the program : this won't
>>include the communication , i/o , loading etc overhead.
>>
>>what method is usually used ?  thanks  
>>
>>tvn,
>>
>>ThanhVu H. Nguyen
>>_______________________________________________
>>Beowulf mailing list, Beowulf at beowulf.org
>>To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050805/ec06c928/attachment.sig>