[Beowulf] GPU's - was Westmere EX
Vincent Diepeveen
diep at xs4all.nl
Thu Apr 7 12:26:57 PDT 2011
On Apr 7, 2011, at 6:25 PM, Gus Correa wrote:
> Vincent Diepeveen wrote:
>
>> GPU monster box, which is basically a few videocards inside such a
>> box stacked up a tad, wil only add a couple of
>> thousands.
>>
>
> This price may be OK for the videocard-class GPUs,
> but sounds underestimated, at least for Fermi Tesla.
Tesla (448 cores @ 1.15Ghz, 3GB ddr5) : $2.200
note there is a 6 GB version, not aware of price will be $$$$ i bet.
or AMD 6990 (3072 PE's @ 0.83Ghz, 4GB ddr5) : 519 euro
VERSUS
8 socket Nehalem-ex, 512GB ram DDR3, basic configuration, $205k.
Factor 100 difference to those cards.
A couple of thousands versus a couple of hundreds of thousands.
Hope i made my point clear.
> Last I checked, a NVidia S2050 pizza box with four Fermi Tesla C2050,
> with 448 cores and 3GB RAM per GPU, cost around $10k.
> For the beefed up version with with C2070 (6GB/GPU) it bumps to ~$15k.
> If you care about ECC, that's the price you pay, right?
When fermi released it was a great gpu.
Regrettably they lobotomized the gamers card's double precision as i
understand,
So it hardly has double precision capabilities; if you go for nvidia
you sure need a Tesla,
no question about it.
As a company i would buy in 6990's though, they're a lot cheaper and
roughly 3x faster
than the Nvidia's (for some more than 3x for other occassions less
than 3x, note the card
has 2 GPU's and 2 x 2GB == 4 GB ram on board so 2GB per gpu).
3072 cores @ 0.83Ghz with 50% of 'em 32 bits multiplication units for
AMD
versus 448 cores nvidia with 448 execution units of 32 bits
multiplication.
Especially because multiplication has improved a lot.
Already having written CUDA code some while ago, i wanted the cheap
gamers card with big
horse power now at home so i'm toying on a 6970 now so will be able
to report to you what is possible to
achieve at that card with respect to prime numbers and such.
I'm a bit amazed so little public initiatives write code for the AMD
gpu's.
Note that DDR5 ram doesn't have ECC by default, but has in case of
AMD a CRC calculation
(if i understand it correctly). It's a bit more primitive than ECC,
but works pretty ok and shows you
also when problems occured there, so figuring out remove what goes on
is possible.
Make no mistake that this isn't ECC.
We know some HPC centers have as a hard requirement ECC, only nvidia
is an alternative then.
In earlier posts from some time ago and some years ago i already
wrote on that governments should
adapt more to how hardware develops rather than demand that hardware
has to follow them.
HPC has too little cash to demand that from industry.
OpenCL i cannot advice at this moment (for a number of reasons).
AMD-CAL and CUDA are somewhat similar. Sure there is differences, but
majority of codes are possible
to port quite well (there is exceptions), or easy work arounds.
Any company doing gpgpu i would advice developing both branches of
code at the same time,
as that gives the company a lot of extra choices for really very
little extra work. Maybe 1 coder,
and it always allows you to have the fastest setup run your
production code.
That said we can safely expect that from raw performance coming years
AMD will keep the leading edge
from crunching viewpoint. Elsewhere i pointed out why.
Even then i'd never bet at just 1 manufacturer. Go for both
considering the cheap price of it.
For a lot of HPC centers the choice of nvidia will be an easy one, as
the price of the Fermi cards
is peanuts compared to the price rest of the system and considering
other demands that's what they'll go for.
That might change once you stick in bunches of videocards in nodes.
Please note that the gpu 'streamcores' or PE's whatever name you want
to give them, are so bloody fast,
that your code has to work within the PE's themselves and hardly use
the RAM.
Both for Nvidia as well as AMD, the streamcores are so fast, that you
simply don't want to lose time on the RAM
when your software runs, let alone that you want to use huge RAM.
Add to that, that nvidia (have to still figure out for AMD) can in
background stream from and to the gpu's RAM
from the CPU, so if you do really large calculations involving many
nodes,
all that shouldn't be an issue in the first place.
So if you really need 3 GB or 6 GB rather than 2 GB of RAM, that
would really amaze me, though i'm sure
there is cases where that happens. If we see however what was ordered
it mostly is the 3GB Tesla's,
at least on what has been reported, i have no global statistics on
that...
Now all choices are valid there, but even then we speak about peanuts
money compared to the price of
a single 8 socket Nehalem-ex box, which fully configured will be
maybe $300k-$400k or something?
Whereas a set of 4x nvidia will be probably under $15k and 4x AMD
6990 is 2000 euro.
There won't be 2 gpu nvidia's any soon because of the choice they
have historically made for the memory controllers.
See explanation of intel fanboy David Kanter for that at
realworldtech in a special article he wrote there.
Please note i'm not judging AMD nor Nvidia, they have made their
choices based upon totally different
businessmodels i suspect and we must be happy we have this rich
choice right now between cpu's from different
manufacturers and gpu's from different manufacturers.
Nvidia really seems to aim at supercomputers, giving their tesla line
without lobotomization and lobotomizing their
gamers cards, where AMD aims at gamers and their gamercards have full
functionality
without lobotomization.
Total different businessmodels. Both have their advantages and
disadvantages.
From pure performance viewpoint it's easy to see what's faster though.
Yet right now i realize all too well that just too many still
hesitate between also offering gpu services additional to
cpu services, in which case having a gpu, regardless nvidia or amd,
kicks butt of course from throughput viewpoint.
To be really honest with you guys, i had expected that by 2011 we
would have a gpu reaching far over 1 Teraflop double precision
handsdown. If we see that Nvidia delivers somewhere around 515 Gflop
and AMD has 2 gpu's on a single card to get over that Teraflop double
precision (claim is 1.27 Teraflop double precision),
that really is underneath my expectations from a few years ago.
Now of course i hope you realize i'm not coding double precision code
at all; i'm writing everything in integers of 32 bits for the AMD
card and the Nvidia equivalent also is using 32 bits integers. The
ideal way to do calculations on those cards, so also very big
transforms, is using the 32 x 32 == 64 bits instructions (that's 2
instructions in case of AMD).
Regards,
Vincent
>
> Gus Correa
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list