[Beowulf] difference between accelerators and co-processors

Mon Mar 11 21:45:40 PDT 2013

>> I don't think it is a useful distinction: both are basiclly
>> independent
>> computers.  obviously, the programming model of Phi is dramatically
>> more
>> like a conventional processor than Nvidia.
>>
>
> Mark, that's the marketing talk about Xeon Phi.

I have no idea what this means.  Nvidia's programming model is:

 	- board provides a small number of cores (1-16 SMs)
 	- each core has a large pool of in-flight instructions to help
 	tolerate memory latency, bank collisions, etc.
 	- instrs are SIMD-like: width 32-192, depending on model.
 	- cores are sparse in FP units, relative to SIMD width.
 	- each SIMD element is slightly thread-like, in that untaken branches
 	are implemented as consuming a slot, but are masked ("divergence")
 	- onboard memory at core and global levels, some software-managed
 	(registers) and/or cache, as well as globally addressible offchip.

Phi's model is this:

 	- board provides 60ish cores with coherent cache(s).
 	- each core has a 512b SIMD FPU, otherwise normal x86_64 registers
 	- cores are in-order, but can be timesliced to hide latency.

> It's surprisingly the same of course except for the cache coherency;
> big vector processors.

neither system is a vector processor in the normal/classic sense.

>> I think HSA is potentially interesting for HPC, too.
>>   I really expect
>> AMD and/or Intel to ship products this year that have a C/GPU chip
>> mounted on
>> the same interposer as some high-bandwidth ram.
>
> How can an integrated gpu outperform a gpgpu card?

if you want dedicated gpu computation, a gpu card is ideal.
obviously, integrated GPUs reduce the PCIe latency overhead,
and/or have an advantage in directly accessing host memory.

I'm merely pointing out that the market has already transitioned 
to putting integrated gpus - the vote on this is closed.
the real question is what direction the onboard gpu takes:
how integrated it becomes with the cpu, and how it will take 
advantage of upcoming 2.5d-stacked in-package dram.

> Something like what is it 25 watt versus 250 watt, what will be faster?

per-watt?  per dollar?  per transaction?

the integrated gpu is, of course, merely a smaller number of cores as the
separate card, so will perform the same, relative to a proportional slice 
of the appropriate-generation add-in-card.

trinity a10-5700 has 384 radeon 69xx cores running at 760 MHz, 
delivering 584 SP gflops - 65W iirc.  but only 30 GB/s for it and the CPU.

let's compare that to a 6930 card: 1280 cores, 154 GB/s, 1920 Gflops.
about 1/3 the cores, flops, and something less than 1/5 the bandwidth.
no doubt the lower bandwidth will hurt some codes, and the lower host-gpu
latency will help others.  I don't know whether APUs have the same 
SP/DP ratio as comparable add-in cards.

> I assume you will not build 10 nodes with 10 cpu's with integrated
> gpu in order to rival a
> single card.

no, as I said, the premise of my suggestion of in-package ram is that 
it would permit glueless tiling of these chips.  the number you could 
tile in a 1U chassis would primarily be a question of power dissipation.
32x 40W units would be easy.  perhaps 20 60W units.  since I'm just 
making up numbers here, I'm going to claim that performance will be 
twice that of trinity (a nice round 1 Tflop apiece or 20 Tflops/RU.
I speculate that 4x 4Gb in-package gddr5 would deliver 64 GB/s, 
2GB/socket - a total capacity of 40 GB/RU at 1280 GB/s.

compare this to a 1U server hosting 2-3 K10 cards = 4.6 Gflops 
and 320 GB/s each.  13.8 Gflops, 960 GB/s.  similar power dissipation.