[Fwd: Re: [Beowulf] Cell in HPC]

Tue May 30 07:12:28 PDT 2006

All,

This is an excellent review of the CELL against leading VLIW/EPIC 
(Itanium), Superscalar (Operton)
and Vector (Cray X1E) processors.  From my reading the messages are:

    1.  Vector operations deliver higher percentage of peak at lower power
          than the alternatives on the HPC kernels.  A comparison to an 
MTA-like highly
          multi-threading architecture is missing in the comparison.  
Most 32-bit numbers
          exceed the non-vector alternatives by substantially more than 
2x in measured
          performance on Dense matrix, Sparse Matrix, Stencil, and FFT 
kernels
          (so dual core will not create parity on sustained/measured [no 
peak] comparisons
           in their view).

    2.  Three tiered memory system with simple local memory (store, like 
old Cray-2)
         that is user/software managed is preferable to cache in the 
above context.
          Double buffering and prefetching to local store reduce memory 
delays dramatically.

    3.  CELL's vector instructions from local memory need augmenting to
         include more "unaligned load" support ... indexed and non-unit 
stride
         capability (seems like loads from memory to the local store do 
have
         these features.)

    4.  Double precision (64-bit operations) are severely hamper by 
instruction issue
         delays.  The reviewer suggest a few minor modifications to the 
design to
         reduce this problem ... so its performance at 64-bits drops off 
dramatically.

They also argue that the CELL chip will be produced in large enough 
quantity
to compete on price with the mulit-core super-scalars.  I am not so sure 
of this.
Also, the issue of vector type memory operations across a "commodity 
interconnect" in
the context of the Beowulf distributed memory architecture is not 
addressed.
Vector memory references are especially revealing of the limitations of 
the RDMA
capabilities of current interconnects.

CELL is a data-parallel heavy weight pitted against the 
instruction-parallel multi-core
alternatives in which the question of how latency should be hidden is 
being considered
--underneath stacks of independent/atomic instuction blocks (threads) 
which may
or may not come from the same program, or with in a pipeline of vector 
operations
that stream data from memory.  Apps with partitionable data with some 
kind of non-random
reference pattern (most HPC appls) favors data-parallelism and vectors, 
while work loads
with more completely random references and the large thread counts 
(graphs algorthms)
typical of the mixed user environment of servers favor the thread-level 
instruction
parallelism.

There is one micro processor architecture that I have seen from MIT. VTA 
(Vector Thread
Architecture) which seems to combine both a workable fashion.  I 
recommend the articles
describing the VTA microprocessor out of Krste Asanovic's group at MIT.  
I think they have the
ISA finished and are taping out the chip as I type.

Regards,

rbw

-- 

Richard B. Walsh

Project Manager
Network Computing Services, Inc.
Army High Performance Computing Research Center (AHPCRC)
rbw at ahpcrc.org  |  612.337.3467

-----------------------------------------------------------------------
This message (including any attachments) may contain proprietary or
privileged information, the use and disclosure of which is legally
restricted.  If you have received this message in error please notify
the sender by reply message, do not otherwise distribute it, and delete
this message, with all of its contents, from your files.
----------------------------------------------------------------------- 

-------------- next part --------------
An embedded message was scrubbed...
From: Richard Walsh <rbw at ahpcrc.org>
Subject: Re: [Beowulf] Cell in HPC
Date: Tue, 30 May 2006 09:06:34 -0500
Size: 10172
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20060530/c037f050/attachment.mht>