[Beowulf] difference between accelerators and co-processors

Thu Mar 14 04:22:18 PDT 2013

On Mar 12, 2013, at 5:45 AM, Mark Hahn wrote:

>>> I don't think it is a useful distinction: both are basiclly
>>> independent
>>> computers.  obviously, the programming model of Phi is dramatically
>>> more
>>> like a conventional processor than Nvidia.
>>>
>>
>> Mark, that's the marketing talk about Xeon Phi.
>
> I have no idea what this means.  Nvidia's programming model is:
>
> 	- board provides a small number of cores (1-16 SMs)
> 	- each core has a large pool of in-flight instructions to help
> 	tolerate memory latency, bank collisions, etc.
> 	- instrs are SIMD-like: width 32-192, depending on model.
> 	- cores are sparse in FP units, relative to SIMD width.
> 	- each SIMD element is slightly thread-like, in that untaken branches
> 	are implemented as consuming a slot, but are masked ("divergence")
> 	- onboard memory at core and global levels, some software-managed
> 	(registers) and/or cache, as well as globally addressible offchip.
>
> Phi's model is this:
>
> 	- board provides 60ish cores with coherent cache(s).
> 	- each core has a 512b SIMD FPU, otherwise normal x86_64 registers
> 	- cores are in-order, but can be timesliced to hide latency.
>
>> It's surprisingly the same of course except for the cache coherency;
>> big vector processors.
>
> neither system is a vector processor in the normal/classic sense.
>

It's massive vector processors.

If you look from a distance Xeon Phi is not much different from a GPU.

Xeon has vectors of 8 doubles versus the SIMD vector units of the  
gpu's even a tad more.

Of course next intel Xeon Phi version 2 will have to use vectors of  
16 doubles, clock it a tad lower,
if they want to really increase processing power of it.

The word 'normal' x86_64 registers is not even interesting to use.  
You ain't gonna invest into a Xeon Phi,
in order to have it  factors slower than the 16 core Xeon box where  
it is located in.

You invest into it to run vector codes that use 8 doubles a vector,  
so you have to write that AVX2 code or whatever you want to call it,
and use 8 doubles in each instruction and just keep on multiplying  
and multiplying.

One keeps hammering at those multiplication instructions. If they  
can't multiply fast, they lose it from the competition.
The rest on that chip is really less interesting, as long as it can  
keep up with the multiplication :)

>>> I think HSA is potentially interesting for HPC, too.
>>>   I really expect
>>> AMD and/or Intel to ship products this year that have a C/GPU chip
>>> mounted on
>>> the same interposer as some high-bandwidth ram.
>>
>> How can an integrated gpu outperform a gpgpu card?
>
> if you want dedicated gpu computation, a gpu card is ideal.
> obviously, integrated GPUs reduce the PCIe latency overhead,
> and/or have an advantage in directly accessing host memory.
>
> I'm merely pointing out that the market has already transitioned to  
> putting integrated gpus - the vote on this is closed.
> the real question is what direction the onboard gpu takes:
> how integrated it becomes with the cpu, and how it will take  
> advantage of upcoming 2.5d-stacked in-package dram.
>
>> Something like what is it 25 watt versus 250 watt, what will be  
>> faster?
>
> per-watt?  per dollar?  per transaction?
>
> the integrated gpu is, of course, merely a smaller number of cores  
> as the
> separate card, so will perform the same, relative to a proportional  
> slice of the appropriate-generation add-in-card.
>
> trinity a10-5700 has 384 radeon 69xx cores running at 760 MHz,  
> delivering 584 SP gflops - 65W iirc.  but only 30 GB/s for it and  
> the CPU.
>
> let's compare that to a 6930 card: 1280 cores, 154 GB/s, 1920 Gflops.
> about 1/3 the cores, flops, and something less than 1/5 the bandwidth.
> no doubt the lower bandwidth will hurt some codes, and the lower  
> host-gpu
> latency will help others.  I don't know whether APUs have the same  
> SP/DP ratio as comparable add-in cards.
>
>> I assume you will not build 10 nodes with 10 cpu's with integrated
>> gpu in order to rival a
>> single card.
>
> no, as I said, the premise of my suggestion of in-package ram is  
> that it would permit glueless tiling of these chips.  the number  
> you could tile in a 1U chassis would primarily be a question of  
> power dissipation.
> 32x 40W units would be easy.  perhaps 20 60W units.  since I'm just  
> making up numbers here, I'm going to claim that performance will be  
> twice that of trinity (a nice round 1 Tflop apiece or 20 Tflops/RU.
> I speculate that 4x 4Gb in-package gddr5 would deliver 64 GB/s, 2GB/ 
> socket - a total capacity of 40 GB/RU at 1280 GB/s.
>
> compare this to a 1U server hosting 2-3 K10 cards = 4.6 Gflops and  
> 320 GB/s each.  13.8 Gflops, 960 GB/s.  similar power dissipation.