<br><br><div class="gmail_quote">2009/8/11 Rahul Nabar <span dir="ltr"><<a href="mailto:rpnabar@gmail.com" target="_blank">rpnabar@gmail.com</a>></span><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
On Tue, Aug 11, 2009 at 12:06 PM, Bill Broadley<<a href="mailto:bill@cse.ucdavis.edu" target="_blank">bill@cse.ucdavis.edu</a>> wrote:<br>
> Looks to me like you fit in the barcelona 512KB L2 cache (and get good<br>
> scaling) and do not fit in the nehalem 256KB L2 cache (and get poor scaling).<br>
<br>
Thanks Bill! I never realized that the L2 cache of the Nehalem is<br>
actually smaller than that of the Barcelona!<br>
<br>
I have an E5520 and a X5550. Both have the 8 MB L3 cache I believe.<br>
THe size of the L2 cache is fixed across the steppings of the Nehlem<br>
isn't it?</blockquote><div><br>I think that probably it only will be fixed on newer models or only in Westmere (Nehalem shrink to 32nm).<br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<br>
<br>
> Were the binaries compiled specifically to target both architectures? As a<br>
> first guess I suggest trying pathscale (RIP) or open64 for amd, and intel's<br>
> compiler for intel. But portland group does a good job at both in most cases.<br>
<br>
We used the intel compilers. One of my fellow grad students did the<br>
actual compilation for VASP but I believe he used the "correct" [sic]<br>
flags to the best of our knowledge. I could post them on the list<br>
perhaps. There was no cross-compilation. We compiled a fresh binary<br>
for the Nehalem.<br>
<br>
> I"m curious about the hyperthreading on data point as well.<br>
<br>
Didn't test for VASP yet but for our other two DFT codes i.e. DACAPO<br>
and GPAW hyperthreading "off" seems to be about 10% faster.<br>
<br>
<br>
> A doubling of the can have that effect. The Intel L3 can no come anywhere<br>
> close to feeding 4 cores running flat out.<br>
<br>
Could you explain this more? I am a little lost with the processor<br>
dynamics. Does this mean using a quad core for HPC on the Nehlem is<br>
not likely to work well for scaling? Or do you imply a solution so<br>
that I could fix this somehow?<br>
</blockquote><div><br>Nehalem and Barcelona have the following cache architecture:<br><br>L1 cache: 64KB (32kb data, 32kb instruction), per core<br>L2 cache: Barcelona :512kb, Nehalem: 256kb, per core<br>L3 cache: Barcelona: 2MB, Nehalem: 8MB , shared among all cores.<br>
<br><br>Both in Barcelona and Nehalem, the "uncore" (everything outside a core, like L3 and memory controllers) runs at lower speed than the cores and all cores communicate through L3, so it must handle some coherence signals too.<br>
This makes impossible to L3 feed all cores at full speed if L2 caches have big miss ratios. <br><br>So, what is happening with your program is something like:<br><br>Working set fits Barcelona 512kb L2 cache, so it has 10% miss rate,<br>
but is doesn't fits Nehalem 256km L2 cache, so it has 50% miss rate.<br>
So in Nehelem the shared L3 cache has to handle much more requests from all cores than Barcelona, becoming a big bottleneck.<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br></div></div><br>