[Fwd: Re: [Beowulf] Cell in HPC]

Wed May 31 16:53:18 PDT 2006

On Wed, 31 May 2006, Mark Hahn wrote:

> > execution models to share instruction code, but splitting L2 data
> > across cores is bound to be a destructive use of the cache in any
> > data parallel model.  Obviously, user control of the cache is a large
> 
> "data parallel model" basically means you're streaming in/out of dram,
> right?  why are these cases not nicely covered by the placement 
> instructions implemented in mmx and followons?  you can control 
> how a load or store behaves wrt different levels of cache cache.  
> IIRC, Intel introduced some new stuff to make the cache shared 
> by cores more effective this way (per-core victim traffic writes through?)

Data parallel in that cores will execute roughly the same
instructions but on disjoint data sets.  Since it's unlikely for the
granularity of a partitioned data set to be less than a cache-line,
one core will rarely cause constructive interference by prefetching a
data cache line that is useful to the other core.  While I think I
remember Woodcrest L2 being quoted as 4MB, I wouldn't conclude that 2
cores effectively see 2MB of L2.  Even if L2 had relatively high
associativity, I'm not sure that high is high enough for scientific
applications that like to partition data along powers of two.

Perhaps it is possible to partition the data along small or tiny
blocking factors with OpenMP.  However, ensuring lock-step or near
lock-step instruction execution across cores as to maximize cache
line reuse on current multi-core technology seems hard (it's probably
interesting research, however). 

    . . christian