[Beowulf] Clusters just got more important - AMD's roadmap

Wed Feb 8 13:01:08 PST 2012

On Feb 8, 2012, at 8:27 PM, Peter Kjellström wrote:

> On Wednesday, February 08, 2012 06:15:01 PM Mark Hahn wrote:
>>> The APU concept has a few interesting points but certainly also a  
>>> few
>>> major problems (when comparing it to a cpu + stand alone gpu setup):
>>>
>>> * Memory bandwidth to all those FPUs
>>
>> well, sorta.  my experience with GP-GPU programming today is that  
>> your
>> first goal is to avoid touching anything offchip anyway (spilling,  
>> etc),
>> so I'm not sure this is a big problem.  obviously, the integrated GPU
>> is a small slice of a "real" add-in GPU, so needs proportionately
>> less bandwidth.
>
> Well yes you want to avoid touching memory on a GPU (just as you do  
> on a CPU).
> But just as you cant completely avoid it on a CPU nor can you on a  
> GPU. On a
> current socket (CPU) you see maybe 20 GB/s and 50 GF and the flop- 
> wise much

50 gflop on a cpu - first of all very little software actually gets  
50 gflop out of a CPU.
It might execute 2 instructions a second in SIMD, yet not when you  
multiply.

To start with it has just 1 multiplication unit, so you already start  
with losing factor 2.

So effective output that the CPU delivers isn't much more than its  
bandwidth and caches
can handle.

Now let's skip the multiply-add further int his discussion
AFAIK most total optimized codes can't use this. yet for gpu's  
discussion is the same there.

But not for the output bandwidth.

In a GPU on other hand you do achieve this throughput it can deliver.  
It's delivering, multiply add not counted,
0.5 Tflop per second, that's 4 Tbytes/s. Or factor 20 above it's  
maximum bandwidth to the RAM.

RAM can get prefetched, yet there are no clever caches on the GPU.  
Some read L2 cache,  that's about it.

Writes to the local shared cache also are not adviced as the  
bandwidth of it is a lot slower than the speed
of the compute units can deliver.

So basically if you read and/or write at full speed to the RAM, you  
slow down factor 20 or so,
a slowdown a CPU does *not* have, as basically it's so slow that CPU,  
that its RAM can keep up with it.

> faster GPU is also alot faster in memory access (>200 GB/s).
>

> Now I admit I'm not a GPU programmer but are you saying those 200  
> GB/s aren't
> needed? My assumption was that the fact that CPU-codes depend on  
> cache for
> performance but still need good memory bandwidth held true even on  
> GPUs.
>
> Anyway, my point I guess was mostly that it's a lot easier to sort out
> hundreds of gigs per second to memory on a device with RAM directly  
> on the PCB
> than on a server socket.
>
> Also, if the APU is a "small slice of a real GPU" then I question  
> the point
> (not much GPU power per classic core or total system foot-print).
>
> ...
>> I think the real question is whether someone will produce a  
>> minimalist
>> APU node.  since Llano has on-die PCIE, it seems like you'd need only
>> APU, 2-4 dimms and a network chip or two.  that's going to add up to
>> very little beyond the the APU's 65 or 100W TDP...  (I figure 150/ 
>> node
>> including PSU overhead.)
>
> I think anything beyond early testing is a fair bit into the  
> future. For the
> APU to become interesting I think we need a few (or all of):
>
>  * Memory shared with the CPU in some useable way (did not say the  
> c-word..)
>  * A proper number crunching version (ecc...)
>  * A fairly high tdp part on a socket with good memory bw
>  * Noticeably better "host to device" bandwidth and even more, latency
>
> And don't get me wrong, I'm not saying the above is particularly  
> unlikely...
>
> /Peter
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf