[Beowulf] bizarre scaling behavior on a Nehalem

Mikhail Kuzminsky kus at free.net
Wed Aug 12 11:50:16 PDT 2009

In message from Gus Correa <gus at ldeo.columbia.edu> (Wed, 12 Aug 2009 
14:09:04 -0400):
>Hi Bill, list
>Bill:  This is very interesting indeed.  Thanks for sharing!
>Bill's graph seem to show that Shanghai and Barcelona scale
>(almost) linearly with the number of cores, whereas Nehalem stops
>scaling and flattens out at 4 cores.
>The Nehalem 8 cores and 4 cores curves are virtually 
>and for very large arrays 4 cores is ahead.
>Only for huge arrays (>16M) Nehalem gets ahead
>of Shanghai and Barcelona.

IMHO, if arrays are not "huge", they will fit in cache L3 (8MB !).
Or on X axe are presented Mwords ?


>Did I interpret the graph right?
>Wasn't this type of scaling problem that plagued
>the Clovertown and Harpertown?
>Any possibility that kernels, BIOS, etc, are not yet ready for 
>Gus Correa
>Gustavo Correa
>Lamont-Doherty Earth Observatory - Columbia University
>Palisades, NY, 10964-8000 - USA
>Bill Broadley wrote:
>> I've been working on a pthread memory benchmark that is loosely 
>>modeled on
>> McCalpin's stream.  It's been quite a challenge to remove all the 
>> performance from the benchmark to get close to performance I 
>>expected.  Some
>> of the obstacles:
>> * For the compilers that tend to be better at stream (open64 and 
>>   you lose the performance if you just replace double a[],b[],c[] 
>>   double *a,*b,*c. Patch[1] available.  I don't have a work around 
>>   this, suggestions welcome.  Is it really necessary for dynamic 
>>   to be substantially slower than static?
>> * You have to be very careful with pointer alignment both with cache 
>>   and each other
>> * cpu_affinity (by CPU id)
>> * numa (by socket id)
>> The results are relatively smooth graphs, here's an example, it's 
>> busy until you toggle off a few graphs (by clicking on the key):
>> http://cse.ucdavis.edu/bill/pstream.svg
>> The biggest puzzle I have now is what the previous generation intel 
>>quads, the
>> current generation AMD quads, and numerous other CPUs show a big 
>>benefit in
>> L1, while the nehalem shows no benefit.
>> [1] http://cse.ucdavis.edu/bill/stream-malloc.patch
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin 
>> To change your subscription (digest mode or unsubscribe) visit 
>Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin 
>To change your subscription (digest mode or unsubscribe) visit 
>MailScanner, É ÍÙ ÎÁÄÅÅÍÓÑ

More information about the Beowulf mailing list