[Beowulf] Execution time measurements (fwd)

Mon Mar 7 21:26:26 PST 2011

Hi list,
I'm forwarding a message (took the liberty of inserting some of my own
comments...)

Mikhail Kuzminsky writes:
> I still don't have the possibility to send the messages directly to Beowulf
>maillist (although I receive all the mail messages). Could you pls forward
>my question to beowulf at beowulf.org ? 
>
> Thanks for your help !
> Mikhail
>
> Execution time measurements
>
> I  obtained strange results for Gaussian-09 (G09) execution times in
>sequential (!) run on dual socket Opteron-2350 based node under Open SuSE
>10.3 (kernel 2.6.22). The measurements were performed for  "pure cpu-bound"
>job (frequencies calculation at DFT level) where start-stop execution time
>is practically equal to cpu execution time (difference is lower than 1 min
>per 1 day of execution). G09 itself prints both cpu execution time, and
>start & stop dates/time information. 
>
> There is some job which execution time is about 1 day. But really it was
>measured two *DIFFERENT* execution times: 1st - 1 day (24h) and 2nd - 1 day
>3 hours (27h). Both results were reproduced few times and gives the same
>quantum-chemical results excluding execution time.  There was no other
>applications run simultaneously w/this measurements. Execution time
>difference isn't localized in any of G09 parts (links).

OK, if I understand: same tests, different timings.

> "Fast" execution  was reproduced 2 times: one - as usual sequential run and
>on - in simultaneous execution of 2 examples of this job (therefore there is
>no mutual influence of this jobs). These runs was not managed  manually in
>the sense of numa allocation. 
>
> "Slow" execution was reproduced minimum 5 times. The memory required for
>job execution is 4+ GB, and it's possible to allocate the necessary RAM from
>one node only (we have 8 GB per node, 16 GB per server). I forced (using
>numactl) to use both cpu (core) and memory from node 1, but it gives "slow"
>execution time. When I forced cpu 1 and memory 0, execution time was
>increased to 1h (up to 28h).

if I understand, the tests runs slower if the process is remote from memory.
that's not at all surprising, right?

> (BTW, g09 links are threads. Does numactl forcing for main module "extends"
>to child threads also ?)

numactl uses sched_setaffinity/etal to set process properties that _are_
inherited.  so unless g09 manipulates the numa settings, the "outer" numactl
settings will control threads and links as well.

> Then I checked G09 own timings via execution under time command. G09 and
>time results was the same, but I looked only "slow" execution time.

I haven't seen inaccuracies in g09's time-reporting.

> The frequency of cpus was fixed (there was no cpufreq kernel module loaded).
>
> I'll be very appreciate in any ideas about possible reason of two different
>execution time observations. 
>
>Mikhail Kuzminsky
>Computer Assistance to Chemical Research Center
>Zelinsky Institute of Organic Chemistry RAS
>Moscow Russia

I think you already have a grip on the answer: any program runs fastest 
when it's running with memory "local" to the CPU it's using.  those pages,
after all, are effectively slower.  if the app doesn't control this (or you
with numactl), then you should expect performance to lie somewhere between
the two extremes (fully local vs fully remote).  the kernel does make some
effort at keeping things local - and for that matter, avoiding moving a 
process among multiple cores/sockets.

how much this matters depends on the app.  anything cache-friendly won't
care - GUIs and a lot of servers would prefer to run sooner, rather than 
insisting on a particular cpu or bank of memory...

I'll confess something here: my organization hasn't bothered with any 
special affinity settings until recently.  I'm not sure how often HPC centers
do worry about this.  obviously, such settings complicate the scheduler's 
job and probably require more tuning by the user...

regards, mark hahn.