[Fwd: Re: [Beowulf] Cell in HPC]

Thu Jun 1 07:04:47 PDT 2006

Christian Bell wrote:
> On Wed, 31 May 2006, Mark Hahn wrote:
>
>   
>>> execution models to share instruction code, but splitting L2 data
>>> across cores is bound to be a destructive use of the cache in any
>>> data parallel model.  Obviously, user control of the cache is a large
>>>       
>> "data parallel model" basically means you're streaming in/out of dram,
>> right?  why are these cases not nicely covered by the placement 
>> instructions implemented in mmx and followons?  you can control 
>> how a load or store behaves wrt different levels of cache cache.  
>> IIRC, Intel introduced some new stuff to make the cache shared 
>> by cores more effective this way (per-core victim traffic writes through?)
>>     
>
> Data parallel in that cores will execute roughly the same
> instructions but on disjoint data sets.  Since it's unlikely for the
> granularity of a partitioned data set to be less than a cache-line,
>   
     I see multiple problems for the compiler here.  The first is the 
one you imply in that
     in order to effectively share the cache the n-threads navigating 
the loop must coordinate
     their loads regardless of the type of load (prefetched, simple 
scalar, and/or SSE/vector).
     We would like the compiler to take into account the inter thread 
implications of the
     global loop load requirements and avoid redundant and/or 
destructive loads.

     The difficulty would seem to be multiplied in the SSE/vector load 
case (which we want
     to make use of for bandwidth efficiency reasons) because we could 
have one thread
     pulling in a second thread's input data as it loads a two-(or 
four)-word-wide vector
     while the second thread running in the neighboring core redundant 
does the same,
     but stride-1 advanced.

     Microarchitecture's with both thread and vector capabilities must 
consider the loop
     work load (in particular its loads) as a two-dimensional problem
     which is sized as 'thread-number-by-vector-length' (an approach 
taken in the VTA
     and X1E architecture).  I would be interested in hearing from 
compiler folks on
     how this problem is/would be handled.  Thread-specific loop 
unrolling would seem
     to be useful (giving one thread compute responsibility for the 
vector of data it loads).
     Then there is the issue of dependencies both with and across threads. 

     This says nothing about managing such vector/thread loads across 
the partitioned global address
     space abstraction pointed at by UPC and CAF parallel programming 
extensions.

      rbw

-- 

Richard B. Walsh

Project Manager
Network Computing Services, Inc.
Army High Performance Computing Research Center (AHPCRC)
rbw at ahpcrc.org  |  612.337.3467

-----------------------------------------------------------------------
This message (including any attachments) may contain proprietary or
privileged information, the use and disclosure of which is legally
restricted.  If you have received this message in error please notify
the sender by reply message, do not otherwise distribute it, and delete
this message, with all of its contents, from your files.
-----------------------------------------------------------------------