[Beowulf] GPU's - was Westmere EX

Thu Apr 7 12:26:57 PDT 2011

On Apr 7, 2011, at 6:25 PM, Gus Correa wrote:

> Vincent Diepeveen wrote:
>
>> GPU monster box, which is basically a few videocards inside such a
>> box stacked up a tad, wil only add a couple of
>> thousands.
>>
>
> This price may be OK for the videocard-class GPUs,
> but sounds underestimated, at least for Fermi Tesla.

Tesla (448 cores @ 1.15Ghz, 3GB ddr5) : $2.200
note there is a 6 GB version, not aware of price will be $$$$ i bet.
or AMD 6990 (3072 PE's @ 0.83Ghz, 4GB ddr5) : 519 euro

VERSUS

8 socket Nehalem-ex, 512GB ram DDR3, basic configuration, $205k.

Factor 100 difference to those cards.

A couple of thousands versus a couple of hundreds of thousands.
Hope i made my point clear.

> Last I checked, a NVidia S2050 pizza box with four Fermi Tesla C2050,
> with 448 cores and 3GB RAM per GPU, cost around $10k.
> For the beefed up version with with C2070 (6GB/GPU) it bumps to ~$15k.
> If you care about ECC, that's the price you pay, right?

When fermi released it was a great gpu.

Regrettably they lobotomized the gamers card's double precision as i  
understand,
So it hardly has double precision capabilities; if you go for nvidia  
you sure need a Tesla,
no question about it.

As a company i would buy in 6990's though, they're a lot cheaper and  
roughly 3x faster
than the Nvidia's (for some more than 3x for other occassions less  
than 3x, note the card
has 2 GPU's and 2 x 2GB == 4 GB ram on board so 2GB per gpu).

3072 cores @ 0.83Ghz with 50% of 'em 32 bits multiplication units for  
AMD
versus 448 cores nvidia with 448 execution units of 32 bits  
multiplication.

Especially because multiplication has improved a lot.

Already having written CUDA code some while ago, i wanted the cheap  
gamers card with big
horse power now at home so  i'm toying on a 6970 now so will be able  
to report to you what is possible to
achieve at that card with respect to prime numbers and such.

I'm a bit amazed so little public initiatives write code for the AMD  
gpu's.

Note that DDR5 ram doesn't have ECC by default, but has in case of  
AMD a CRC calculation
(if i understand it correctly). It's a bit more primitive than ECC,  
but works pretty ok and shows you
also when problems occured there, so figuring out remove what goes on  
is possible.

Make no mistake that this isn't ECC.
We know some HPC centers have as a hard requirement ECC, only nvidia  
is an alternative then.

In earlier posts from some time ago and some years ago i already  
wrote on that governments should
adapt more to how hardware develops rather than demand that hardware  
has to follow them.

HPC has too little cash to demand that from industry.

OpenCL i cannot advice at this moment (for a number of reasons).

AMD-CAL and CUDA are somewhat similar. Sure there is differences, but  
majority of codes are possible
to port quite well (there is exceptions), or easy work arounds.

Any company doing gpgpu i would advice developing both branches of  
code at the same time,
as that gives the company a lot of extra choices for really very  
little extra work. Maybe 1 coder,
and it always allows you to have the fastest setup run your  
production code.

That said we can safely expect that from raw performance coming years  
AMD will keep the leading edge
from crunching viewpoint. Elsewhere i pointed out why.

Even then i'd never bet at just 1 manufacturer. Go for both  
considering the cheap price of it.

For a lot of HPC centers the choice of nvidia will be an easy one, as  
the price of the Fermi cards
is peanuts compared to the price rest of the system and considering  
other demands that's what they'll go for.

That might change once you stick in bunches of videocards in nodes.

Please note that the gpu 'streamcores' or PE's whatever name you want  
to give them, are so bloody fast,
that your code has to work within the PE's themselves and hardly use  
the RAM.

Both for Nvidia as well as AMD, the streamcores are so fast, that you  
simply don't want to lose time on the RAM
when your software runs, let alone that you want to use huge RAM.

Add to that, that nvidia (have to still figure out for AMD) can in  
background stream from and to the gpu's RAM
from the CPU, so if you do really large calculations involving many  
nodes,
all that shouldn't be an issue in the first place.

So if you really need 3 GB or 6 GB rather than 2 GB of RAM, that  
would really amaze me, though i'm sure
there is cases where that happens. If we see however what was ordered  
it mostly is the 3GB Tesla's,
at least on what has been reported, i have no global statistics on  
that...

Now all choices are valid there, but even then we speak about peanuts  
money compared to the price of
a single 8 socket Nehalem-ex box, which fully configured will be  
maybe $300k-$400k or something?

Whereas a set of 4x nvidia will be probably under $15k and 4x AMD  
6990 is 2000 euro.

There won't be 2 gpu nvidia's any soon because of the choice they  
have historically made for the memory controllers.
See explanation of intel fanboy David Kanter for that at  
realworldtech in a special article he wrote there.

Please note i'm not judging AMD nor Nvidia, they have made their  
choices based upon totally different
businessmodels i suspect and we must be happy we have this rich  
choice right now between cpu's from different
manufacturers and gpu's from different manufacturers.

Nvidia really seems to aim at supercomputers, giving their tesla line  
without lobotomization and lobotomizing their
gamers cards, where AMD aims at gamers and their gamercards have full  
functionality
without lobotomization.

Total different businessmodels. Both have their advantages and  
disadvantages.

 From pure performance viewpoint it's easy to see what's faster though.

Yet right now i realize all too well that just too many still  
hesitate between also offering gpu services additional to
cpu services, in which case having a gpu, regardless nvidia or amd,  
kicks butt of course from throughput viewpoint.

To be really honest with you guys, i had expected that by 2011 we  
would have a gpu reaching far over 1 Teraflop double precision  
handsdown. If we see that Nvidia delivers somewhere around 515 Gflop  
and AMD has 2 gpu's on a single card to get over that Teraflop double  
precision (claim is 1.27 Teraflop double precision),
that really is underneath my expectations from a few years ago.

Now of course i hope you realize i'm not coding double precision code  
at all; i'm writing everything in integers of 32 bits for the AMD  
card and the Nvidia equivalent also is using 32 bits integers. The  
ideal way to do calculations on those cards, so also very big  
transforms, is using the 32 x 32 == 64 bits instructions (that's 2  
instructions in case of AMD).

Regards,
Vincent

>
> Gus Correa
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf