[Beowulf] coprocessor to do "physics calculations"

Mon May 15 21:35:37 PDT 2006

(apologies for going a bit further off topic)

Aegia is chasing the broad market, so the pricing is $275 for the Aegia 
card.[1] It certain spreads the spectrum of price/performance for tuned 
code. It doesn't do much for time to solution. The programming model is 
clearly the single biggest obstacle to success. The GPU market benefited 
from standardized APIs for video display like DirectX & OpenGL. Aegia 
has their own games engines so they explicitly control the programming 
model.[2] There are rumors of a MSFT DirectPhysics API to bring together 
the underlying APIs for game physics to be supported on GPUs, PPUs & 
CPUs.[3] A great idea, but daunting to execute. GPUs, Aegia, Cell & 
GRAPE have a luxury, developing the hardware & software for a specific 
application area sufficient to amortize development costs. Systems 
mixing FPGAs and CPUs are broadly used in high-end imaging solutions. 
These look on paper to be a programming nightmare. Perhaps, the obstacle 
is not programmability, but defining the right application.[4] In the 
age of (relatively) inexpensive FPGA gates and ASICs why aren't 
communities of users seeking hardware & software partners to speed up 
the critical loops? With increased bandwidth, lower latencies and 
standard interfaces available on commodity platforms, you would only pay 
a premium for for the part of the system that delivers the performance. 
Wasn't this some of the same thinking that moved codes from SMPs to 
beowulf clusters in the first place? A modestly difficult programming 
task to take advantage of emerging hardware performance.[5] However, I 
recognize the uphill battle.[6] Math & solver libraries can deliver 
improved performance on a hardware platform with minimal changes to 
code. Despite the ease & possible performance gains I only know a 
handful of commercial codes that make use of vendor supplied libraries. 
It is expensive to qualify multiple environments. Once you give the 
end-users some latitude on libraries, who knows what else they may 
develop. They may plug in an APU that improves performance 20x, lowers 
licensing revenue and introduces a different rounding scheme than the 
original binary - accurate, yet different - and call you for support.

[1] Based upon Alienware's online configurations
[2] They run in lower resolution on CPUs and high realism when the Aegia processor is present.
[3] http://digg.com/software/Microsoft_making_their_own_physics_SDK_API
[4] I nominate weather codes
[5] If I can divide my code into intelligent work units, then I don't need to run them in a single shared memory machine. If I can divide my code into intelligent work units, then I don't need to run them on the same type of processor. 
[6] And let's not start on the complexity of managing a cluster of truly heterogeneous nodes... I'm sure Don Becker is already working on this one ;) 

Date: Sun, 14 May 2006 14:37:22 -0400 (EDT) From: Mark Hahn 
<hahn at physics.mcmaster.ca> Subject: Re: [Beowulf] coprocessor to do 
"physics calculations" To: beowulf at beowulf.org Message-ID: 
<Pine.LNX.4.44.0605141251570.7486-100000 at coffee.psychology.mcmaster.ca> 
Content-Type: TEXT/PLAIN; charset=US-ASCII

> > Didn't see anyone post this link regarding Aegia Physix processor. It is the most comprehensive write up I have seen.
> > 
> > http://www.blachford.info/computer/articles/PhysX1.html
>   

yes, and even so it's not very helpful.  "fabric connecting compute and
memory elements" pretty well covers it!  the block diagram they give
could almost apply directly to Cell, for instance.

fundamentally, about these cell/aegia/gpu/fpga approaches,
you have to ask:

	- how cheap will it be in final, off-the-shelf systems?  GPUs
	are most attractive this way, since absurd gaming cards have 
	become a check-off even on corporate PCs (and thus high volume.)
	it's unclear to me whether Cell will go into any million-unit 
	products other than dedicated game consoles.

	- does it run efficiently-enough?  most sci/eng I see is pretty
	firmly based on 64b FP, often with large data.  but afaikt, 
	Cell (eg) doesn't do well on anything but in-cache 32b FP.
	GPUs have tantalizingly high local-mem bandwidth, but also 
	don't really do anything higher than 32b.

	- how much time will it take to adapt to the peculiar programming
	model necessary for the device?  during the time spent on that,
	what will happen to the general-pupose CPU market?

I think price, performance and time-to-market are all stacked against this 
approach, at least for academic/research HPC.  it would be different if the
general-purpose CPU market stood still, or if there were no way to scale up
existing clusters...

------------------------------