[Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster.

Thu Mar 10 11:48:19 PST 2005

At 09:19 AM 3/10/2005, Robert G. Brown wrote:
>On Thu, 10 Mar 2005, Joe Landman wrote:
>
> >
> > Part of what motivates this question are things like the Cray XD1 FPGA
> > board, or PathScale's processors (unless I misunderstood their
> > functions).  Other folks have CPUs on a card of various sorts, ranging
> > from FPGA to DSPs.   I am basically wondering aloud what sort of demand
> > for such technology might exist.  I assume the answer starts with "if
> > the price is right" ...  the question is what is that price, what are
> > the features/functionality, and how hard do people want to work on such
> > bits.
>
>Problems with coprocessing solutions include:
>
>   a) Cost -- sometimes they are expensive, although they >>can<< yield
>commensurate benefits for some code as you point out.
>
>   b) Availability -- I don't just mean whether or not vendors can get
>them; I mean COTS vs non-COTS.  They are frequently on-of-a-kind beasts
>with a single manufacturer.

Definitely an issue.

>   c) Usability.  They typically require "special tools" to use them at
>all.  Cross-compilers, special libraries, code instrumentation.  All of
>these things require fairly major programming effort to implement in
>your code to realize the speedup, and tend to decrease the
>general-purpose portability of the result, tying you even more tightly
>(after investing all this effort) with the (probably one) manufacturer
>of the add-on.

To a certain extent, though, this is being mitigated by things like Signal 
Processing Workbench or Matlab, which have "plug ins" to convert generic 
algorithm descriptions (i.e. simulink models, etc.) into runnable code on 
the coprocessor or FPGA.

As far as product lock-in goes, "in theory" one could just recompile for a 
new target processor, although I don't know if anyone's ever done this.

It does greatly reduce the "time and cost to demonstrate capability"

>   c) Continued Availability -- They also not infrequently disappear
>without a trace (as "general purpose" coprocessors, not necessarily as
>ASICs) within a year or so of being released and marketed.  This is
>because Moore's Law is brutal, and even if a co-processor DOES manage to
>speed up your actual application (and not just a core loop that
>comprises 70% of your actual application) by a factor of ten, that's at
>most four or five years of ML advances.  If your code has a base of 30%
>or so that isn't sped up at all (fairly likely) then your application
>runs maybe 2-3 times as fast at best and ML eats it in 1-3 years.

There are specialized applications, lending themselves to clusters, for 
which this might not hold. If we look at Xilinx FPGAs, for instance, while 
not quite doubling every 18 months, they ARE dramatically increasing in 
speed and size fairly quickly.  And, it's not hugely difficult to take a 
design that ran at speed X on size Y Xilinx FPGA and port it to speed A on 
Size B Xilinx FPGA.

Consider a classic big crunching ASIC/FPGA application, that of running 
many correlators in parallel to demodulate very faint signals buried in 
noise (specifically, raw data coming back from deep space probes), or some 
applications in radio astronomy.  In the latter case, particularly, there's 
a lot of interest in taking an array of radio telescopes and simultaneously 
forming many beams, so you can look lots of directions at once, to look for 
transient events that are "interesting" (like supernovae).  The radio 
astronomy community is relatively poor (Paul Allen's interest 
notwithstanding), so they've got an incentive to use cheap commodity 
processing for their needs, but off the shelf PCs might not hack 
it.  They're looking at a lot of architectures that strongly resemble the 
usual cluster... data from all antennas streams into a raft of processors 
via ethernet, and each processor forms some subset of beams either in space 
or frequency. They might have a coprocessor card in the machine that does 
some of the early really intensive beamforming computation.

Take a look at the Allen Telescope Array or at the Square Kilometer Array 
or at LOFAR.

>Anecdotally I'm reminded of e.g. the 8087, Micro Way's old transputer
>sets (advertised in PC mag for decades), the i860 (IIRC), the CM-5, and
>many other systems built over the years that tried to provide e.g. a
>vector co-processor in parallel with a regular general purpose CPU,
>sometimes on the same motherboard and bus, sometimes on daughterboards
>or even on little mini-network connections hung off the bus somehow.
>
>None of these really caught on (except for the 8087, and it is an
>exercise for the studio audience as to why an add-on processor that
>really should have been a part of the original processor itself, made by
>the mfr of the actual crippled CPU from the beginning, succeeded),

THat's pretty easy.  In the good old days, you had an integer CPU and an 
add on FPU in almost all architectures. The FPU didn't have instruction 
decoding, sequencing, or anything like that.. more like an extra ALU that 
tied to the internal bus.  Just like having memory management in a separate 
chip.  Intel and Motorola both used this approach. Intel did start to 
integrate the MMU into the chip with "segment registers" on the 8086, 
except that it provided zip, zero, none, nada memory protection.  This was 
part of a strategy to keep the codebase compatible with the 8080. After 
all, who in their right mind would write a program bigger than 64K.. the 
user application code would never look at the segment registers, which 
would be managed by a multitasking OS.  Think of it as integrated "bank 
switching", which was quite popular in the 8bit processor world (and 
itself, an outgrowth of how PDP-11 memory mangement worked)

It wasn't until the 80286 that it started to be some more sophistication, 
and really, it was the 386 that made decent memory management possible.

Moto started with a virtual memory scheme and paging, and so became the 
darling of software folks who had come to expect such things from the 
PDP-11, DEC-10, DG, and even mainframe world.

In any case, NONE of them could have fit the FPU on the die and had decent 
yields. Besides, you're talking processors that cost $200-400 (in 1980s) 
and processors with integrated FPUs would have cost upwards of $1K-$1.5K 
(because of the lower yield). As fab technology advanced, you could either 
build bigger faster processors (in the separate CPU/FPU model) or you could 
build integrated processors at the same slow speed.

Even today, I'd venture to guess that the vast number of CPU cycles spent 
on PCs are integer mode computations (bitblts and the like to make windows 
work).  It's not like you need FP to do Word or PowerPoint, or even 
Excel.  It's rendered 3D graphics that really drives FP performance in the 
consumer market.

This drives an interesting battle between the graphics ASIC makers (so that 
an add on card can do the rendering) and the CPU makers (who want to put it 
onboard, so that the total system cost is less), and, as well the support 
provided by MS Windows to use either one effectively.  The game market 
clearly doesn't want to have to try and support ALL the possible graphics 
cards out there (it was a nightmare trying to write high performance 
graphics applications back in the late 80's, early 90s.  The few skilled 
folks who were good at it earned their shekels.)

>although nearly all of them were used by at least a few intrepid
>individuals to great benefit.  Allowing that Nature is efficient in its
>process of natural selection, this seems like a genetic/memetic
>variation that generally lacks the CBA advantages required to make it a
>real success.
>
>    rgb

James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875