[Beowulf] coprocessor to do "physics calculations"
douglas at shore.net
Mon May 15 21:35:37 PDT 2006
(apologies for going a bit further off topic)
Aegia is chasing the broad market, so the pricing is $275 for the Aegia
card. It certain spreads the spectrum of price/performance for tuned
code. It doesn't do much for time to solution. The programming model is
clearly the single biggest obstacle to success. The GPU market benefited
from standardized APIs for video display like DirectX & OpenGL. Aegia
has their own games engines so they explicitly control the programming
model. There are rumors of a MSFT DirectPhysics API to bring together
the underlying APIs for game physics to be supported on GPUs, PPUs &
CPUs. A great idea, but daunting to execute. GPUs, Aegia, Cell &
GRAPE have a luxury, developing the hardware & software for a specific
application area sufficient to amortize development costs. Systems
mixing FPGAs and CPUs are broadly used in high-end imaging solutions.
These look on paper to be a programming nightmare. Perhaps, the obstacle
is not programmability, but defining the right application. In the
age of (relatively) inexpensive FPGA gates and ASICs why aren't
communities of users seeking hardware & software partners to speed up
the critical loops? With increased bandwidth, lower latencies and
standard interfaces available on commodity platforms, you would only pay
a premium for for the part of the system that delivers the performance.
Wasn't this some of the same thinking that moved codes from SMPs to
beowulf clusters in the first place? A modestly difficult programming
task to take advantage of emerging hardware performance. However, I
recognize the uphill battle. Math & solver libraries can deliver
improved performance on a hardware platform with minimal changes to
code. Despite the ease & possible performance gains I only know a
handful of commercial codes that make use of vendor supplied libraries.
It is expensive to qualify multiple environments. Once you give the
end-users some latitude on libraries, who knows what else they may
develop. They may plug in an APU that improves performance 20x, lowers
licensing revenue and introduces a different rounding scheme than the
original binary - accurate, yet different - and call you for support.
 Based upon Alienware's online configurations
 They run in lower resolution on CPUs and high realism when the Aegia processor is present.
 I nominate weather codes
 If I can divide my code into intelligent work units, then I don't need to run them in a single shared memory machine. If I can divide my code into intelligent work units, then I don't need to run them on the same type of processor.
 And let's not start on the complexity of managing a cluster of truly heterogeneous nodes... I'm sure Don Becker is already working on this one ;)
Date: Sun, 14 May 2006 14:37:22 -0400 (EDT) From: Mark Hahn
<hahn at physics.mcmaster.ca> Subject: Re: [Beowulf] coprocessor to do
"physics calculations" To: beowulf at beowulf.org Message-ID:
<Pine.LNX.4.44.0605141251570.7486-100000 at coffee.psychology.mcmaster.ca>
Content-Type: TEXT/PLAIN; charset=US-ASCII
> > Didn't see anyone post this link regarding Aegia Physix processor. It is the most comprehensive write up I have seen.
> > http://www.blachford.info/computer/articles/PhysX1.html
yes, and even so it's not very helpful. "fabric connecting compute and
memory elements" pretty well covers it! the block diagram they give
could almost apply directly to Cell, for instance.
fundamentally, about these cell/aegia/gpu/fpga approaches,
you have to ask:
- how cheap will it be in final, off-the-shelf systems? GPUs
are most attractive this way, since absurd gaming cards have
become a check-off even on corporate PCs (and thus high volume.)
it's unclear to me whether Cell will go into any million-unit
products other than dedicated game consoles.
- does it run efficiently-enough? most sci/eng I see is pretty
firmly based on 64b FP, often with large data. but afaikt,
Cell (eg) doesn't do well on anything but in-cache 32b FP.
GPUs have tantalizingly high local-mem bandwidth, but also
don't really do anything higher than 32b.
- how much time will it take to adapt to the peculiar programming
model necessary for the device? during the time spent on that,
what will happen to the general-pupose CPU market?
I think price, performance and time-to-market are all stacked against this
approach, at least for academic/research HPC. it would be different if the
general-purpose CPU market stood still, or if there were no way to scale up
More information about the Beowulf