[Beowulf] Re: vectors vs. loops

Vincent Diepeveen diep at xs4all.nl
Wed May 4 01:15:30 PDT 2005


It is far cheaper to shift calculations onto 1 chip with gigabytes a second
memory bandwidth, than to have todays chips which rely upon network cards
(or hubs) to transfer data at at most 1 gigabyte a second towards switches
and routers. 

Matrix calculations at thousands of processors are no problem with such a
limited bandwidth between the nodes.

If we consider a 8 node system that shifts from 8 nodes to 1 node now, it's
simply easier to do the same calculation within 1 cell processor than it is
to do the same calculation within 8 nodes.

Of course it's true that such drastically faster chips mean only that
networks of real fine quality will survive, rather than the cheapest, but
the most important thing is that the entire parallellization of the
software gets easier, not harder, assuming you can keep within 1 chip with
the calculations.

The thing that gets harder, relatively spoken, are calculations that aren't
satisfied with that 256 gflop, but which need many nodes to do the work.

So to speak an extra layer needs to be build in software to still keep
using networks that deliver 'only' a few hundreds of megabytes from one
node to another.

In any case, a single node with cell processors delivers more gflops than a
256 node, 1024 processor origin3800 my government had bought in the year 2000.

That really is real nice for scientists, considering such a single cell
node should be having a reasonable price.

Governments can real cheap now buy a few cells in order to get a bunch of
gflops, and to serve the integer programs they have far more choice now than
in the past, to buy a cheap huge supercomputer there.

Vincent

At 09:18 PM 5/3/2005 -0400, Michael T. Prinkey wrote:
>On Wed, 4 May 2005, Vincent Diepeveen wrote:
     
>
     
>> This isn't at all the problem. So to speak for highend processors even a
     
>> medium engineer can add a few on die memory controllers to solve the
     
>> bandwidth problem. 1 for every few cores.  
>
>I'm not a hardware engineer, but it seems to me that pin count becomes an
>issue.  With N cores and M memory controllers, there would need to be
>M*bus_width pins on the socket and M independent banks of RAM modules.  
>That doesn't seem viable as producing a chip package with such a high pin
>count and motherboards with so many traces will not be very economical.  
>The motivation for multicore chips is precisely to avoid that packaging
>and motherboard complexity.  But, to do it another way (say, with memory
>controllers sharing a bus to memory) would not provide scalable memory
>performance due to bus contention.
>
>The Cell processor seems to do away with cache for its vector units and 
>instead allocates a flat scratch space of memory in its place.  Data has 
>to be DMA to and from main memory explicitly.  This avoids all of the 
>cache coherency logic.  This does provide scalable memory performance...at 
>least to the limits of that memory space.
>
>It is interesting to note that these CPU/memory units in the Cell look
>more and more like a "computer-on-a-chip" and the Cell vector unit itself
>is basically a "cluster-on-a-chip."  Communication to main memory and to
>other vector units needs to be explicitly scheduled (a'la MPI).  Each core
>is optimally balanced in memory bandwidth and computational performance.  
>The data that it works on is local *by definition* and hence fast.  And
>the messy logic associated with cache maintenance gets flushed away, but
>at the expense of requiring explicit fetching/message passing.
>
>This may well be the future.  Small cores with ever growing "scratch"  
>space that uses explicit fetch/stores to main memory.  Maybe some/most of
>this will change from SRAM to a DRAM to make these local memories larger.  
>Whatever the case, main storage needs to find its way closer to the
>processing core.  Double/Quad pumped buses and 128-bit memory channels can
>only take us so far.
>
>Cell is a test architecture.  It will be interesting to see if the
>compilers can uncover the parallelism and make good use of this approach
>or if the technology will fade into the background like VLIW/EPIC mostly
>has.
>
>Mike Prinkey
>
>
>
>



More information about the Beowulf mailing list