[Beowulf] Teraflop chip hints at the future

Fri Feb 16 14:17:20 PST 2007

Jim Lux wrote:
> At 07:03 AM 2/13/2007, Richard Walsh wrote:
>> Yes, but how much does it really abandon von Neumann.  It is just a lot
>> of little von Neumann machines unless the mesh is fully programmable
>> and the DRAM stacks can source data for any operation on any cpu as
>> the application's data flows through the application kernel(s) 
>> however it
>> is laid out across the chip.  And in that case it is a multi-core 
>> ASIC emulating
>> an FPGA ... why not just use an FPGA ... ;-) ... and avoid wasting 
>> all those
>> hard-wired functional units that won't be needed for this or that 
>> particular
>> kernel.
> In fact, modern high density FPGAs (viz Xilinx Virtex II 6000 series) 
> have partitioned their innards into little cells, some with ALU and 
> combinatorial logic and a little memory, some with lots of memory and 
> not so much logic.
    Hey Jim,

    Yes, I do understand this although attention for double precision 
ops on FPGAs is focused on the
    Xilinx Virtex-5 at 65 nm.  You can already get a PCIe card version I 
think.  My comments about
    new 80-core/ASIC Intel chip were to suggest two things ... first was 
that having the ability to
    program your own (ala VHDL, Verilog, Mitrion-C, Handel-C, etc. ) 
core that is specific to your
    kernel is more circuit-efficient in theory, so if you are going to 
have multiple cores consider having them
    be programmable.  Its like the plumber that brings only and all the 
tools he needs into to house to
    do the job at hand.

    The second point I was trying to make was that all cyclic 
re-referencing of the same store (local or
    remote) is a reflection of the von Neuman model (even to the stacked 
DRAM in the new Intel chip).
    When the processor cannot "swallow the kernel whole" it has to 
consume it in von Neuman-like
    bites which imply register, cache, and memory writes.  Part of the 
programmable core process is in
    making the connections between upstream and downstream hardware in a 
data-flow fashion that
    replace some number of cyclic stores with in-line passes to the next 
collection of functional units
    required by the applications specific kernel.

    In this way, the "diameter" of the re-reference cycle is enlarged 
and the latency penalty is therefore reduced.
    So while the ASIC-cores in the new Intel chip are not programmable 
in the FPGA sense there is the
    hope/expectation that the interconnect on the chip will give the 
data flow benefits described.  These are
    the features of the multi-core TRIPS and Raw processors that allow 
them to emulate ILP, TLP, and DLP oriented
    architectures and applications.  The extent to which FPGAs are more 
flexible in this regard give
    them an advantage over less "wire-exposed" multi-core ASIC 
architectures.

    There are obvious draw backs to FPGAs ... they are not commodity 
enough, programmability is
    poor, foriegn, and the improvements (Mitrion-C) generally consume 2x 
the circuits and run at 1/2
    the clock that the FPGA in use is capable of.  Joe Landman pointed 
out the large chunk of the device
    that the interface architecture can consume, and for HPC size data 
sets you still need to stream data
    in and out to external memory (algorithms must be pipelined).  Still 
it seems like over the long
    haul some of the FPGA advantages mentioned will creep into the HPC 
space -- either on the chip
    or via accelerators.  Underwood at Sandia has nice a paper showing 
that peak flop performance
    on FPGAs exceed commodity CPUs in summer of 2004 (same time Intel 
dropped the race
    to the 4.0 GHz clock) ... although the data needs to be updated with 
the Virtex-5 and the new
    multi-core processors. 

    Here are some papers that I think you can Google that I have found 
useful/interesting.

       1. Evaluation of the Raw Microprocessor:  An Exposed-Wire-Delay 
Architecture for ILP and
           Streams.   Taylor, et al.

       2.  Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS 
Architecture.

       3.  FPGAs vs CPUs:  Trends in Peak Floating-Point Performance.  
Keith Underwood.

       4.  Architectures and APIs"  Assessing Requirements for 
Delivering FPGA Performance to
            Applications.   Underwood and Hemmert

       5.  A 64-bit Floating-point FPGA Matrix Multiplications.  Yong 
Dou et al.

       6.  Scalable and Modular Algorithms for Floating-Point Matrix 
Multiplication on FPGAs
            Ling Zhuo and Viktor Prasanna

      7. Computing Lennard-Jones Potentials and Forces wth 
Reconfigurable Hardware
> I think that as a general rule, the special purpose cores (ASICs) are 
> going to be smaller, lower power, and faster (for a given technology) 
> than the programmable cores (FPGAs).  Back in the late 90s, I was 
> doing tradeoffs between general
    Here you are arguing for an ASIC for each typical HPC kernel ... ala 
the GRAPE processor.  I will buy that ... but
    a commodity multi-core, CPU is not HPC-special-purpose or low power 
compared to an FPGA.
> purpose CPUs (PowerPCs), DSPs (ADSP21020), and FPGAs for some signal 
> processing applications.  At that time, the DSP could do the FFTs, 
> etc, for the least joules and least time.  Since then, however, the 
> FPGAs have pulled ahead, at least for spaceflight applications.   But 
> that's not because of architectural superiority in a given process.. 
> it's that the FPGAs are benefiting from improvements in process 
> (higher density) and nobody is designing space qualified DSPs using 
> those processes (so they are stuck with the old processes).
    Better process is good, but I think I hear you arguing for 
HPC-specific ASICs again like the GRAPE ... if they
    can be made cheaply, then you are right ... take the bit stream from 
the FPGA CFD code I have written and tuned, and
    produce 1000 ASICs for my special purpose CFD-only cluster.  I can 
run it at higher clock rates, but I may need a
    new chip every time I change my code.
> Heck, the latest SPARC V8 core from ESA (LEON 3) is often implemented 
> in an FPGA, although there are a couple of space qualified ASIC 
> implementations (from Atmel and Aeroflex).
>
> In a high volume consumer application, where cost is everything, the 
> ASIC is always going to win over the FPGA.  For more specialized 
> scientific computing, the trade is a bit more even ... But even so, 
> the beowulf concept of combining large numbers of commodity computers 
> leverages the consumer volume for the specialized application, giving 
> up some theoretical performance in exchange for dollars.
     Right, otherwise we would all be using our own version of  GRAPE, 
but we are all looking for "New, New Thing"
     ... a new price-performance regime to take us up to the next 
level.  Is it going to be FPGAs, GPGPUs, commodity
     multi-core, PIM, or novel 80-processor Intel chips.  I think we are 
in for a period of extend HPC market
     fragmentation, but in any case I think two features of FPGA 
processing, the programmable core and data flow
     programming model have intrinsic/theoretical appeal.   These forces 
may be completely overwhelmed by other
     forces in the market place of course ...

     Regards,

     rbw

-- 

Richard B. Walsh

"The world is given to me only once, not one existing and one
 perceived. The subject and object are but one."

Erwin Schroedinger

Project Manager
Network Computing Services, Inc.
Army High Performance Computing Research Center (AHPCRC)
rbw at ahpcrc.org  |  612.337.3467

-----------------------------------------------------------------------
This message (including any attachments) may contain proprietary or
privileged information, the use and disclosure of which is legally
restricted.  If you have received this message in error please notify
the sender by reply message, do not otherwise distribute it, and delete
this message, with all of its contents, from your files.
-----------------------------------------------------------------------