[Beowulf] bizarre scaling behavior on a Nehalem

Gus Correa gus at ldeo.columbia.edu
Wed Aug 12 12:27:06 PDT 2009


Hi Bill, list

Bill:  Many thanks for all the answers.

Thanks also for the important clarification.
So, the graphs you sent before
compare dual socket Shanghai and Barcelona,
to single socket Nehalem, right?

This changes the perception a lot,
as one should at most compare the 4-thread Shanghai and Barcelona
curves (assuming the threads were running on a single socket)
to the 4-thread Nehalem curves, right?
The 8-thread curves are different animals.

Would you have the (full) comparison to dual socket Nehalem,
perhaps using the SMT feature also, and up to 16 threads?

The benefit of SMT in HPC codes you mention matches what I saw
with SMT on PPC IBM machines running climate models.
(I don't have access to Nehalems to try the same codes for now.)

Thank you,
Gus Correa

Bill Broadley wrote:
> Gus Correa wrote:
>> Hi Bill, list
>>
>> Bill:  This is very interesting indeed.  Thanks for sharing!
>>
>> Bill's graph seem to show that Shanghai and Barcelona scale
>> (almost) linearly with the number of cores, whereas Nehalem stops
>> scaling and flattens out at 4 cores.
> 
> Right.  That's not really surprising since the core i7 has only 4 cores.  I
> wasn't testing a dual socket nehalem.  So on a single socket core i7 that I
> tested the hyperthreading provided no additional performance.  None to
> surprising since hyperthreading is about sharing idle functional units, but
> doesn't do much when the cache or memory system is saturated.
> 
>> The Nehalem 8 cores and 4 cores curves are virtually indistinguishable,
> 
> Yes, but it was 8 threads on 4 cores, vs 4 threads on 4 cores.  I'd expect
> something less memory intensive and more cpu intensive would show a big
> difference.  In fact many of the HPC codes I've tried see a benefit.
> 
>> and for very large arrays 4 cores is ahead.
>> Only for huge arrays (>16M) Nehalem gets ahead
>> of Shanghai and Barcelona.
> 
> Yes, impressive that a single socket intel has more main memory bandwidth then
> a dual socket shanghai.
> 
>> Did I interpret the graph right?
>> Wasn't this type of scaling problem that plagued
>> the Clovertown and Harpertown?
> 
> Heh, the mention single socket core i7 has substantially more (2-4x) memory
> bandwidth of the previous generation intels.
> 
>> Any possibility that kernels, BIOS, etc, are not yet ready for Nehalem?
> 
> They look good for me, still trying to find out why I don't see better
> performance inside L1 though.




More information about the Beowulf mailing list