[Fwd: Re: [Beowulf] Cell in HPC]
Richard Walsh
rbw at ahpcrc.org
Thu Jun 1 07:04:47 PDT 2006
Christian Bell wrote:
> On Wed, 31 May 2006, Mark Hahn wrote:
>
>
>>> execution models to share instruction code, but splitting L2 data
>>> across cores is bound to be a destructive use of the cache in any
>>> data parallel model. Obviously, user control of the cache is a large
>>>
>> "data parallel model" basically means you're streaming in/out of dram,
>> right? why are these cases not nicely covered by the placement
>> instructions implemented in mmx and followons? you can control
>> how a load or store behaves wrt different levels of cache cache.
>> IIRC, Intel introduced some new stuff to make the cache shared
>> by cores more effective this way (per-core victim traffic writes through?)
>>
>
> Data parallel in that cores will execute roughly the same
> instructions but on disjoint data sets. Since it's unlikely for the
> granularity of a partitioned data set to be less than a cache-line,
>
I see multiple problems for the compiler here. The first is the
one you imply in that
in order to effectively share the cache the n-threads navigating
the loop must coordinate
their loads regardless of the type of load (prefetched, simple
scalar, and/or SSE/vector).
We would like the compiler to take into account the inter thread
implications of the
global loop load requirements and avoid redundant and/or
destructive loads.
The difficulty would seem to be multiplied in the SSE/vector load
case (which we want
to make use of for bandwidth efficiency reasons) because we could
have one thread
pulling in a second thread's input data as it loads a two-(or
four)-word-wide vector
while the second thread running in the neighboring core redundant
does the same,
but stride-1 advanced.
Microarchitecture's with both thread and vector capabilities must
consider the loop
work load (in particular its loads) as a two-dimensional problem
which is sized as 'thread-number-by-vector-length' (an approach
taken in the VTA
and X1E architecture). I would be interested in hearing from
compiler folks on
how this problem is/would be handled. Thread-specific loop
unrolling would seem
to be useful (giving one thread compute responsibility for the
vector of data it loads).
Then there is the issue of dependencies both with and across threads.
This says nothing about managing such vector/thread loads across
the partitioned global address
space abstraction pointed at by UPC and CAF parallel programming
extensions.
rbw
--
Richard B. Walsh
Project Manager
Network Computing Services, Inc.
Army High Performance Computing Research Center (AHPCRC)
rbw at ahpcrc.org | 612.337.3467
-----------------------------------------------------------------------
This message (including any attachments) may contain proprietary or
privileged information, the use and disclosure of which is legally
restricted. If you have received this message in error please notify
the sender by reply message, do not otherwise distribute it, and delete
this message, with all of its contents, from your files.
-----------------------------------------------------------------------
More information about the Beowulf
mailing list