[Fwd: Re: [Beowulf] Cell in HPC]
Richard Walsh
rbw at ahpcrc.org
Tue May 30 07:12:28 PDT 2006
All,
This is an excellent review of the CELL against leading VLIW/EPIC
(Itanium), Superscalar (Operton)
and Vector (Cray X1E) processors. From my reading the messages are:
1. Vector operations deliver higher percentage of peak at lower power
than the alternatives on the HPC kernels. A comparison to an
MTA-like highly
multi-threading architecture is missing in the comparison.
Most 32-bit numbers
exceed the non-vector alternatives by substantially more than
2x in measured
performance on Dense matrix, Sparse Matrix, Stencil, and FFT
kernels
(so dual core will not create parity on sustained/measured [no
peak] comparisons
in their view).
2. Three tiered memory system with simple local memory (store, like
old Cray-2)
that is user/software managed is preferable to cache in the
above context.
Double buffering and prefetching to local store reduce memory
delays dramatically.
3. CELL's vector instructions from local memory need augmenting to
include more "unaligned load" support ... indexed and non-unit
stride
capability (seems like loads from memory to the local store do
have
these features.)
4. Double precision (64-bit operations) are severely hamper by
instruction issue
delays. The reviewer suggest a few minor modifications to the
design to
reduce this problem ... so its performance at 64-bits drops off
dramatically.
They also argue that the CELL chip will be produced in large enough
quantity
to compete on price with the mulit-core super-scalars. I am not so sure
of this.
Also, the issue of vector type memory operations across a "commodity
interconnect" in
the context of the Beowulf distributed memory architecture is not
addressed.
Vector memory references are especially revealing of the limitations of
the RDMA
capabilities of current interconnects.
CELL is a data-parallel heavy weight pitted against the
instruction-parallel multi-core
alternatives in which the question of how latency should be hidden is
being considered
--underneath stacks of independent/atomic instuction blocks (threads)
which may
or may not come from the same program, or with in a pipeline of vector
operations
that stream data from memory. Apps with partitionable data with some
kind of non-random
reference pattern (most HPC appls) favors data-parallelism and vectors,
while work loads
with more completely random references and the large thread counts
(graphs algorthms)
typical of the mixed user environment of servers favor the thread-level
instruction
parallelism.
There is one micro processor architecture that I have seen from MIT. VTA
(Vector Thread
Architecture) which seems to combine both a workable fashion. I
recommend the articles
describing the VTA microprocessor out of Krste Asanovic's group at MIT.
I think they have the
ISA finished and are taping out the chip as I type.
Regards,
rbw
--
Richard B. Walsh
Project Manager
Network Computing Services, Inc.
Army High Performance Computing Research Center (AHPCRC)
rbw at ahpcrc.org | 612.337.3467
-----------------------------------------------------------------------
This message (including any attachments) may contain proprietary or
privileged information, the use and disclosure of which is legally
restricted. If you have received this message in error please notify
the sender by reply message, do not otherwise distribute it, and delete
this message, with all of its contents, from your files.
-----------------------------------------------------------------------
-------------- next part --------------
An embedded message was scrubbed...
From: Richard Walsh <rbw at ahpcrc.org>
Subject: Re: [Beowulf] Cell in HPC
Date: Tue, 30 May 2006 09:06:34 -0500
Size: 10172
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20060530/c037f050/attachment.mht>
More information about the Beowulf
mailing list