[Beowulf] any gp-gpu clusters?

Sat Jun 23 08:57:16 PDT 2007

Hello Mark,

Well i've been past few weeks investigating cards and what it seems
is that so far the marketing department is far ahead of actual performance.

At this 8800 card the fastest FFT that i could find is claiming 100 gflop 
out of a very expensive 8800 card
that on paper should deliver nearly half a teraflop.

That is quite dissappointing.

Then we didn't even investigate that FFT yet, as it seems to do something 
that most of us don't need at all,
what we all need is far more complicated to get really well to work on those 
cards.

We also didn't discuss even how to do big matrix calculations knowing the 
complexity of implementing this into the
architecture.

You mention a thought that many have had already, namely if you build a 
cluster, that within a year or 2 you can
quite easily upgrade the cards in each node.

Though this sounds interesting, right now a single card isn't delivering 
more than what a quadcore can deliver you,
whereas this quadcore can do much more and can use more RAM.

When the power6 system got presented in Amsterdam a week ago (40 Tflop in 
2008, right now it's power5 and 14 Tflop),
i still can remember how one scientist was very happy with the 64GB of ram 
that each node has, as RAM speeds his calculations up more than additionally 
processing power.

So he for sure won't line up for calculating within videocards with limited 
RAM.

If you plan to put a card or 4 into a single node, please realize that a 
single quadcore node eats about 172 watt (when not using videocard nor i/o) 
or 180 watt when using a videocard, this with all 4 cores at full usage.

This where a single videocard is having a TDP of far over 200 watt, so at 
full usage.

If you plan to put in a videocard or 4 @ 225 watt each, you have some 
monster of an energy bill in return.

The easiest programming language (CUDA) also delivers the smallest amount of 
performance it seems,
versus ATI's 2900 card.

The advantages of using a bunch of videocards in a single node is basically 
next:

a) the speculation that the next generation videocards from ATI and NVIDIA 
will deliver great performance for those
who can use the card

b) the theoretic possibility to save upon network costs, as the network is 
basically a pci-e 16x slot at the mainboard.

So where one card is perhaps nearly equal to a quadcore, just on paper, for 
something that needs very little RAM; it is obvious that if you put in 4 of 
those cards that you still just need 1 network card in the node to connect 
the network.

c) on paper it would be possible that nodes equipped with 2 videocards, 1 
simple card to adress the system and 1 card to do calculations upon, can be 
used by 2 users at the same time. One person could use on paper the 
videocard and the other one the rest of the node. This is however wishful 
thinking as of now. Which university is going to put in a monster that eats 
200 watt or so at full performance and that just 1 or 2 users can use?

There is however a few weaknesses that remains:

a) you need n+1 cards in a system to use n cards for calculations

b) The measured latency, so not theoretic but practical latency measured 
here, between RAM and cards
RAM are far worse than that network cards deliver; 50 us roughly for the 
8800 versus 1.5 us roughly for network cards one-way ping pong latency.

The bandwidth is not better either and with several cards a node that'll 
deteriorate probably.

c) the limited amount of RAM on-card and the huge price for cards that do 
have more than half a gigabyte DDR3,
nvidia's high clocked cards really are quite expensive.

d) the huge mass production that ATI and NVIDIA must achieve in order to 
sell those cards to keep price a bit affordable instead of thousands a card 
is counter productive in our direction. For just graphics all they need is 
single precision floating point, whereas the few guys (that's people in this 
beowulf list) who want a card that is programmable like a cpu and use it for 
DSP type workloads is quite limited. They need to produce and sell tens of 
millions of those cards so selling a couple of thousands to calculation type 
workloads is not real interesting to ati/nvidia and it is rather wishful 
thinking that cards will get really optimized for what we really need.

e) it is very hard to get information about the cards, like how caches work, 
yes it's not even clear how BIG caches are on a card and what bottlenecks 
are on the cards. So programming for those cards in a manner that HPC needs, 
namely getting the utmost performance out of it, is total impossible to do 
with some generic programming language. It requires complete fulltime 
dedication to do so, have friends at nvidia or ati to get more info and so 
on. It is very specialistic work in short.

This is currently by far the biggest obstacle to start programming for those 
cards.

f) the few attempts that have been tried so far had very dissappointing 
results for whatever reason, the lack of information basically means that 
the huge marketing balloons of ATI and NVIDIA promising nearly half a 
teraflop now a card are just not even close to reality. Every project on it 
so far has failed to deliver more performance than existing generic code 
already delivers at c2q.

That said, on paper there is a theoretic possibility that such cards in 
future (perhaps end 2007) get huge Teraflop capabilities single precision, 
which cpu's won't have any soon, so keeping an eye on them is very 
interesting. As of now the graphics cards are simply our only hope to get 
great gflop capabilities for a small price.

Giving up that dream not many of us will want to do.

Yet so far it is a mystery how to beat a 3Ghz core2 @ 16 cores dual Xeon 
node with a big L2/L3 with such a graphics card that has such tiny caches 
and is lobotomized everywhere so that the total number of instructions it 
can process on paper simply can never be true?

To keep objective, ATI's latest 2900 card has 64 streaming processors which 
ATI markets as 320 by the way, lying directly factor 5, and is just 742Mhz 
clocked. So you start at a disadvantage against core2 of a factor: 2.4Ghz / 
0.742 = 3.2

So you must somewhere win a factor 3.2 to just *keep the same speed* for 
your code.

This where at 22 july the 2.4ghz quadcore drops to 266 dollar whereas the 
ati2900 is currently priced nearly 400 EURO here.

It is very hard to compete when you already must make up for a factor 3+ to 
start with.
That 4.7Ghz power6 is far more interesting in that sense, yet i know in 
advance i won't get any system time at it,
whereas i CAN buy a videocard for a couple of hundreds of euro's.

The future will provide answers therefore whether future graphics chips can 
kick butt for a small price, i sure hope so.

Thanks,
Vincent

----- Original Message ----- 
From: "Mark Hahn" <hahn at mcmaster.ca>
To: "Beowulf Mailing List" <Beowulf at beowulf.org>
Sent: Thursday, June 21, 2007 4:57 PM
Subject: [Beowulf] any gp-gpu clusters?

> Hi all,
> is anyone messing with GPU-oriented clusters yet?
>
> I'm working on a pilot which I hope will be something like 8x 
> workstations, each with 2x recent-gen gpu cards.
> the goal would be to host cuda/rapidmind/ctm-type gp-gpu development.
>
> part of the motive here is just to create a gpu-friendly infrastructure 
> into which commodity cards can be added and refreshed every 8-12 months. 
> as opposed to "investing" in quadro-level cards which are too expensive 
> enough to toss when obsoleted.
>
> nvidia's 1U tesla (with two g80 chips) looks potentially attractive,
> though I'm guessing it'll be premium/quadro-priced - not really in keeping 
> with the hyper-moore's-law mantra...
>
> if anyone has experience with clustered gp-gpu stuff, I'm interested in 
> comments on particular tools, experiences, configuration of the host
> machines and networks, etc.  for instance, is it naive to think that 
> gp-gpu is most suited to flops-heavy-IO-light apps, and therefore doesn't
> necessarily need a hefty (IB, 10Geth) network?
>
> thanks, mark hahn.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
>