[Beowulf] Clusters just got more important - AMD's roadmap
Vincent Diepeveen
diep at xs4all.nl
Wed Feb 8 10:41:34 PST 2012
On Feb 8, 2012, at 6:15 PM, Mark Hahn wrote:
>> The APU concept has a few interesting points but certainly also a
>> few major
>> problems (when comparing it to a cpu + stand alone gpu setup):
>>
>> * Memory bandwidth to all those FPUs
>
> well, sorta. my experience with GP-GPU programming today is that your
> first goal is to avoid touching anything offchip anyway (spilling,
> etc),
> so I'm not sure this is a big problem. obviously, the integrated GPU
> is a small slice of a "real" add-in GPU, so needs proportionately
> less bandwidth.
Most of the code that's real fast on gpgpu simply doesn't leave the
compute units at all.
For outsiders: a compute unit is basically 1 vector core (or SIMD) of
a gpu with its own registers and its
own shared memory (64 KB or so at nvidia + registers which is quite a
tad and 32 KB sharedmemory for AMD + a big multiple of that for local
registers)
So that's 64 PE's (processing elements) at newer generation AMD's
(6000 and 7000 series), or 32 at nvidia.
Nvidia has 512 PE's and latest AMD has 2048 PE's.
You really don't want to touch the RAM much in gpgpu computing. RAM
slows down.
There is zero difference from programming model there between AMD and
Nvidia gpu's.
Anything that does other stuff than just inside a compute unit of 32
or 64 'cores' is not gonna scale well.
Good example is the Trial Factorisation for Mersenne that works at
Nvidia very well in CUDA.
Basically candidates get generated at cpu's, shipped a bunch to the
gpu, then all calculations occur within a compute unit for a bunch of
candidates.
The problem there you stumble upon as well is not so much the
bandwidth from cpu to gpu. It's simply the problem
that the CPU's are not fast enough to generate candidates for the
GPU, as the GPU is a 200x faster or so than CPU core.
The cpu's just can't feed the gpu as they're too slow generating
factor candidates to keep the gpu busy.
Remember this is just a single GPU and a relative cheap one.
As for games, one would guess it's easier to scale well for graphics,
yet they do not. Call it clumsy programming, call it bad paid coders,
call it 'not necessary to fix as we'll buy a faster gpu soon'; as a
result you typically see that gpgpu programs that scale well,
they cause the gpu's to eat a lot more power than any game.
>
>> * Power (CPUs in servers today max out around 120W with GPUs at
>> >250W)
>
> sure, though the other way to think of this is that you have 250W
> or so of power overhead hanging off your GPU cards. you can amortize
> the "host overhead" by adding several GPUs, but...
>
> think of it this way: an APU is just a low-mid-end add-in GPU
> with the host integrated onto it ;)
>
> I think the real question is whether someone will produce a minimalist
> APU node. since Llano has on-die PCIE, it seems like you'd need only
> APU, 2-4 dimms and a network chip or two. that's going to add up to
> very little beyond the the APU's 65 or 100W TDP... (I figure 150/node
> including PSU overhead.)
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list