[Beowulf] coprocessor to do "physics calculations"
Joe Landman
landman at scalableinformatics.com
Sun May 14 13:28:10 PDT 2006
Mark Hahn wrote:
>> Didn't see anyone post this link regarding Aegia Physix processor. It is the most comprehensive write up I have seen.
>>
>> http://www.blachford.info/computer/articles/PhysX1.html
>
> yes, and even so it's not very helpful. "fabric connecting compute and
> memory elements" pretty well covers it! the block diagram they give
> could almost apply directly to Cell, for instance.
>
> fundamentally, about these cell/aegia/gpu/fpga approaches,
> you have to ask:
>
> - how cheap will it be in final, off-the-shelf systems? GPUs
> are most attractive this way, since absurd gaming cards have
> become a check-off even on corporate PCs (and thus high volume.)
> it's unclear to me whether Cell will go into any million-unit
> products other than dedicated game consoles.
This will drive prices for the Cell way down. Volume has a habit of
helping do that. FPGAs will likely remain several thousand dollars per
unit (Virtex 4 and above) unless you can drive many units, in which case
you have to start looking at the economics of ASICs if your algorithm
never changes. If you have frequently changing algorithms, or want to
build a special processor per code, then you need the programmability of
the FPGA. In order for this to make sense from a price point of view,
you have to see what the overall performance you get out of it. Few
people (I think) would be willing to pay $10kUSD for 10x performance
delta, though I would think that closer to 100x delta, this price
wouldn't be an issue.
> - does it run efficiently-enough? most sci/eng I see is pretty
> firmly based on 64b FP, often with large data. but afaikt,
Numerical stuff is pretty much DP FP right now. I saw one of the FPGA
grape units running a stellar dynamics simulator at SC05. If you are
willing to give up IEEE754/854 for performance, you can do some pretty
amazing things.
> Cell (eg) doesn't do well on anything but in-cache 32b FP.
The idea with Cell and pretty much all APUs (acceleration processing
units) out there today is you need to double buffer and constantly
stream data in. This limits which algorithms they can work on, though
not terribly so.
> GPUs have tantalizingly high local-mem bandwidth, but also
> don't really do anything higher than 32b.
Single precision isn't so bad for many calculations. You would be
surprised how many of the Auto companies run long crash simulations this
way. There are other considerations than base data type accuracy that
can swamp the calculations.
> - how much time will it take to adapt to the peculiar programming
> model necessary for the device? during the time spent on that,
> what will happen to the general-pupose CPU market?
Yes. This is why any APU must be easy to program. Non-programmable
APUs or minimally programmable units (fixed function units) are doomed
to niches at best. You need to be able to turn your codes around on it
very quickly, in a time comparable to days, not months of Verilog/VHDL.
> I think price, performance and time-to-market are all stacked against this
> approach, at least for academic/research HPC. it would be different if the
I disagree. On specific codes (possibly not FP heavy right now if we
are talking about FPGAs), the price performance will be difficult to
beat, the performance will be difficult to beat. The time to market is
critical. Part of this is accelerator card design. Part of it is ease
of spinning new applications. Application turn around time cannot
exceed something close to a month, or no one will do it.
For various informatics codes, you can get 100-300x type performance
deltas (I have seen 300x reported in papers, others have reported
higher). If you can get 100x better performance by adding in a $10kUSD
board, would you do it?
For chemistry codes and other FP heavy codes, you need a DP (64b)
accelerator. FPGAs don't make good DP FP units right now, IEEE754 is
expensive in terms of gates. You can't get enough of them on there.
Best I have heard of is the SRC MAP processor which had something like
100 units, running at 150 MHz, that could just eek out 11 GFlops. As
this is comparable to dual core Opteron, this is not the way you want to
go for double precision floating point. There are other options (now
and coming on line).
> general-purpose CPU market stood still, or if there were no way to scale up
> existing clusters...
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 734 786 8452
cell : +1 734 612 4615
More information about the Beowulf
mailing list