[Beowulf] Clusters just got more important - AMD's roadmap

Wed Feb 8 10:41:34 PST 2012

On Feb 8, 2012, at 6:15 PM, Mark Hahn wrote:

>> The APU concept has a few interesting points but certainly also a  
>> few major
>> problems (when comparing it to a cpu + stand alone gpu setup):
>>
>> * Memory bandwidth to all those FPUs
>
> well, sorta.  my experience with GP-GPU programming today is that your
> first goal is to avoid touching anything offchip anyway (spilling,  
> etc),
> so I'm not sure this is a big problem.  obviously, the integrated GPU
> is a small slice of a "real" add-in GPU, so needs proportionately
> less bandwidth.

Most of the code that's real fast on gpgpu simply doesn't leave the  
compute units at all.

For outsiders: a compute unit is basically 1 vector core (or SIMD) of  
a gpu with its own registers and its
own shared memory (64 KB or so at nvidia + registers which is quite a  
tad and 32 KB sharedmemory for AMD + a big multiple of that for local  
registers)

So that's 64 PE's (processing elements) at newer generation AMD's  
(6000 and 7000 series), or 32 at nvidia.
Nvidia has 512 PE's and latest AMD has 2048 PE's.

You really don't want to touch the RAM much in gpgpu computing. RAM  
slows down.
There is zero difference from programming model there between AMD and  
Nvidia gpu's.

Anything that does other stuff than just inside a compute unit of 32  
or 64 'cores'  is not gonna scale well.

Good example is the Trial Factorisation for Mersenne that works at  
Nvidia very well in CUDA.

Basically candidates get generated at cpu's, shipped a bunch to the  
gpu, then all calculations occur within a compute unit for a bunch of
candidates.

The problem there you stumble upon as well is not so much the  
bandwidth from cpu to gpu. It's simply the problem
that the CPU's are not fast enough to generate candidates for the  
GPU, as the GPU is a 200x faster or so than CPU core.

The cpu's just can't feed the gpu as they're too slow generating  
factor candidates to keep the gpu busy.

Remember this is just a single GPU and a relative cheap one.

As for games, one would guess it's easier to scale well for graphics,  
yet they do not. Call it clumsy programming, call it bad paid coders,
call it 'not necessary to fix as we'll buy a faster gpu soon'; as a  
result you typically see that gpgpu programs that scale well,
they cause the gpu's to eat a lot more power than any game.

>
>> * Power (CPUs in servers today max out around 120W with GPUs at  
>> >250W)
>
> sure, though the other way to think of this is that you have 250W
> or so of power overhead hanging off your GPU cards.  you can amortize
> the "host overhead" by adding several GPUs, but...
>
> think of it this way: an APU is just a low-mid-end add-in GPU
> with the host integrated onto it ;)
>
> I think the real question is whether someone will produce a minimalist
> APU node.  since Llano has on-die PCIE, it seems like you'd need only
> APU, 2-4 dimms and a network chip or two.  that's going to add up to
> very little beyond the the APU's 65 or 100W TDP...  (I figure 150/node
> including PSU overhead.)
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf