[Beowulf] any gp-gpu clusters?
diep at xs4all.nl
Sat Jun 23 08:57:16 PDT 2007
Well i've been past few weeks investigating cards and what it seems
is that so far the marketing department is far ahead of actual performance.
At this 8800 card the fastest FFT that i could find is claiming 100 gflop
out of a very expensive 8800 card
that on paper should deliver nearly half a teraflop.
That is quite dissappointing.
Then we didn't even investigate that FFT yet, as it seems to do something
that most of us don't need at all,
what we all need is far more complicated to get really well to work on those
We also didn't discuss even how to do big matrix calculations knowing the
complexity of implementing this into the
You mention a thought that many have had already, namely if you build a
cluster, that within a year or 2 you can
quite easily upgrade the cards in each node.
Though this sounds interesting, right now a single card isn't delivering
more than what a quadcore can deliver you,
whereas this quadcore can do much more and can use more RAM.
When the power6 system got presented in Amsterdam a week ago (40 Tflop in
2008, right now it's power5 and 14 Tflop),
i still can remember how one scientist was very happy with the 64GB of ram
that each node has, as RAM speeds his calculations up more than additionally
So he for sure won't line up for calculating within videocards with limited
If you plan to put a card or 4 into a single node, please realize that a
single quadcore node eats about 172 watt (when not using videocard nor i/o)
or 180 watt when using a videocard, this with all 4 cores at full usage.
This where a single videocard is having a TDP of far over 200 watt, so at
If you plan to put in a videocard or 4 @ 225 watt each, you have some
monster of an energy bill in return.
The easiest programming language (CUDA) also delivers the smallest amount of
performance it seems,
versus ATI's 2900 card.
The advantages of using a bunch of videocards in a single node is basically
a) the speculation that the next generation videocards from ATI and NVIDIA
will deliver great performance for those
who can use the card
b) the theoretic possibility to save upon network costs, as the network is
basically a pci-e 16x slot at the mainboard.
So where one card is perhaps nearly equal to a quadcore, just on paper, for
something that needs very little RAM; it is obvious that if you put in 4 of
those cards that you still just need 1 network card in the node to connect
c) on paper it would be possible that nodes equipped with 2 videocards, 1
simple card to adress the system and 1 card to do calculations upon, can be
used by 2 users at the same time. One person could use on paper the
videocard and the other one the rest of the node. This is however wishful
thinking as of now. Which university is going to put in a monster that eats
200 watt or so at full performance and that just 1 or 2 users can use?
There is however a few weaknesses that remains:
a) you need n+1 cards in a system to use n cards for calculations
b) The measured latency, so not theoretic but practical latency measured
here, between RAM and cards
RAM are far worse than that network cards deliver; 50 us roughly for the
8800 versus 1.5 us roughly for network cards one-way ping pong latency.
The bandwidth is not better either and with several cards a node that'll
c) the limited amount of RAM on-card and the huge price for cards that do
have more than half a gigabyte DDR3,
nvidia's high clocked cards really are quite expensive.
d) the huge mass production that ATI and NVIDIA must achieve in order to
sell those cards to keep price a bit affordable instead of thousands a card
is counter productive in our direction. For just graphics all they need is
single precision floating point, whereas the few guys (that's people in this
beowulf list) who want a card that is programmable like a cpu and use it for
DSP type workloads is quite limited. They need to produce and sell tens of
millions of those cards so selling a couple of thousands to calculation type
workloads is not real interesting to ati/nvidia and it is rather wishful
thinking that cards will get really optimized for what we really need.
e) it is very hard to get information about the cards, like how caches work,
yes it's not even clear how BIG caches are on a card and what bottlenecks
are on the cards. So programming for those cards in a manner that HPC needs,
namely getting the utmost performance out of it, is total impossible to do
with some generic programming language. It requires complete fulltime
dedication to do so, have friends at nvidia or ati to get more info and so
on. It is very specialistic work in short.
This is currently by far the biggest obstacle to start programming for those
f) the few attempts that have been tried so far had very dissappointing
results for whatever reason, the lack of information basically means that
the huge marketing balloons of ATI and NVIDIA promising nearly half a
teraflop now a card are just not even close to reality. Every project on it
so far has failed to deliver more performance than existing generic code
already delivers at c2q.
That said, on paper there is a theoretic possibility that such cards in
future (perhaps end 2007) get huge Teraflop capabilities single precision,
which cpu's won't have any soon, so keeping an eye on them is very
interesting. As of now the graphics cards are simply our only hope to get
great gflop capabilities for a small price.
Giving up that dream not many of us will want to do.
Yet so far it is a mystery how to beat a 3Ghz core2 @ 16 cores dual Xeon
node with a big L2/L3 with such a graphics card that has such tiny caches
and is lobotomized everywhere so that the total number of instructions it
can process on paper simply can never be true?
To keep objective, ATI's latest 2900 card has 64 streaming processors which
ATI markets as 320 by the way, lying directly factor 5, and is just 742Mhz
clocked. So you start at a disadvantage against core2 of a factor: 2.4Ghz /
0.742 = 3.2
So you must somewhere win a factor 3.2 to just *keep the same speed* for
This where at 22 july the 2.4ghz quadcore drops to 266 dollar whereas the
ati2900 is currently priced nearly 400 EURO here.
It is very hard to compete when you already must make up for a factor 3+ to
That 4.7Ghz power6 is far more interesting in that sense, yet i know in
advance i won't get any system time at it,
whereas i CAN buy a videocard for a couple of hundreds of euro's.
The future will provide answers therefore whether future graphics chips can
kick butt for a small price, i sure hope so.
----- Original Message -----
From: "Mark Hahn" <hahn at mcmaster.ca>
To: "Beowulf Mailing List" <Beowulf at beowulf.org>
Sent: Thursday, June 21, 2007 4:57 PM
Subject: [Beowulf] any gp-gpu clusters?
> Hi all,
> is anyone messing with GPU-oriented clusters yet?
> I'm working on a pilot which I hope will be something like 8x
> workstations, each with 2x recent-gen gpu cards.
> the goal would be to host cuda/rapidmind/ctm-type gp-gpu development.
> part of the motive here is just to create a gpu-friendly infrastructure
> into which commodity cards can be added and refreshed every 8-12 months.
> as opposed to "investing" in quadro-level cards which are too expensive
> enough to toss when obsoleted.
> nvidia's 1U tesla (with two g80 chips) looks potentially attractive,
> though I'm guessing it'll be premium/quadro-priced - not really in keeping
> with the hyper-moore's-law mantra...
> if anyone has experience with clustered gp-gpu stuff, I'm interested in
> comments on particular tools, experiences, configuration of the host
> machines and networks, etc. for instance, is it naive to think that
> gp-gpu is most suited to flops-heavy-IO-light apps, and therefore doesn't
> necessarily need a hefty (IB, 10Geth) network?
> thanks, mark hahn.
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf