[Beowulf] Opteron performance

Joe Landman landman at scalableinformatics.com
Sat Nov 27 16:30:41 PST 2004



Kozin, I (Igor) wrote:

>Regarding Opteron performance and performance in general.
>
>To the Opteron users out there, would you mind sharing your
>experience to date regarding Opteron performance?
>
>It would be particularly interesting to hear 
>- what the problems are (if any), 
>  
>

NUMA support is maturing.  Processor/memory affinity, memory layout, and 
other issues impact benchmarks dramatically.   You want your thread 
running on a CPU which resides adjacent to the memory holding the 
application.  You want to make sure you set up your memory system 
properly, I have seen too many benchmarks run on systems with poorly 
configured memory. 

>- kernel comparisons preferably in quantifiable terms (e.g. 
>  RH vs Suse vs GNU 2.6, NUMA and O(1) scheduler support etc),
>  
>

Don't use a 2.4 kernel if you can avoid it.  The RH kernels (even with 
backports) are ancient.  Lots of good things are missing (IMO).  SuSE 
ships/supports 2.6, as do some others.  2.6 knows more about NUMA.

>- real benefit of hypertransport etc.
>I'd suggest to leave compiler comparisons out.
>
>What bothers me primarily is that you have to run a benchmark 
>many more times than usual to get the best performance on an Opteron.
>I've heard other people also mentioning "strange" Opteron behaviour.
>I'd actually suggest to report a standard deviation error
>with every performance figure. I've seen as large as 50% deviation 
>in performance from run to run. 
>  
>

Reporting an SD is a wise thing generally (it is a measurement, and you 
expect some width to it).  There are far too many "benchmarks" out there 
where people run their tests once, get a number, and have no sense of 
how repeatable their measurement is.  They are happy to draw conclusions 
from it though.

>Here is one sequence of serial performance for example 
>(time in seconds, Opteron244,Suse with 2.4.19 kernel): 
>3254, 2579, 3258, 2582, 3258, 2658. Clearly if I state that the 
>Opteron can do the job in 2579 sec it makes little practical sense 
>because in 50% cases it will do 26% slower. 
>I tested the same executable with GNU 2.6.8 kernel and 
>was pleased to observe much smaller deviation
>(about twice as small but then I could not get quite the same
>best performance I had before: 2632 vs 2579 previously).
>  
>

If you look carefully at the numbers, they are not uniformly 
distributed.  You appear to have a bimodal distribution that seems to be 
consistent with processor affinity problems on a dual processor.   I 
would guess that the higher numbers represent the cases when the memory 
for CPMD was on processor 0 and the the code itself ran on processor 1.  
This means an HT hop to get the data.  You can force affinity using some 
of the affinity scheduler tools (Robert Love's 
http://tech9.net/rml/schedutils/ tools)

>I must admit that the application is memory bandwidth hungry
>and generally the more the application is memory intensive
>the higher the deviation. In the case above some of the memory
>had to be used off the second cpu. 
>  
>

The ancient kernels (ala RH 2.4 series) all demonstrated this sort of 
problem for me as well.  The more modern kernels do much better.

>Also in a parallel application the kernel issues are more likely 
>to be blurred out because of the averaging across many cpus.
>It is reasonable to imagine that had Opteron a better kernel 
>the performance could have been better too.
>  
>

Use 2.6.  You will not be sorry.

>The latest disappointment was to observe CPMD performance
>on a quad Opteron 848 (2.2 GHz) vs a dual Opteron 246 (2 GHz):
>(CPMD 3.9.1, wat32 benchmark, run 1/ run 2)
>serial performance: 1963s/3395s  vs 2204s/3466s - alright 
>   Opteron 848 is quicker.
>parallel: 223s/378s (4x4) vs 218s/360s (8x2) - Opteron 246 is
>   quicker.
>How much should I expect from HT anyway? If it is not again
>about proper kernel support then I'd be better off having
>dual Opterons if I need to run CPMD.
>  
>
There is more opportunity for contention in a quad than a dual if the 
scheduler does not know about NUMA, or have processor affinity in mind 
for scheduling.

Joe

>
>Kind regards,
>Igor
>
>I. Kozin   at dl.ac.uk
>CCLRC Daresbury Laboratory
>tel: 01925 603308
>http://www.cse.clrc.ac.uk/disco
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>  
>

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 612 4615




More information about the Beowulf mailing list