[Beowulf] Xeon Phi questions - does it have *any* future?

Mark Hahn hahn at mcmaster.ca
Sat Dec 15 12:58:22 PST 2012

> The big question i would like to ask intel architects is whether the
> Xeon Phi architecture has a future,

it's hard for me to imagine how this is not a silly question.

> so what comes AFTER this Xeon Phi?

more cores.  lower power-per-core.  higher clocks.  more memory bandwidth.
more per-core cache and/or flops.

this is a place where stacking dram would be a significant win, though
perhaps it's hard to manage given that modern chips are all normally

> From what i understand it has cache coherency - otherwise it could
> not run x86 codes at a slow speed.

it runs x86 because that's such a familiar ISA, and so many tools/codes
can run without significant change.  (not that ARM would be difficult
to imagine, but obviously different from NVidia's ISA.)  cache coherency
does not make a core run slowly - indeed, there's no reason to believe 
that cache coherency is not eminently scalable (using directories, of
course).  what's not scalable is programming techniques that thrash the 
coherency mechanism.

cache coherency is really best thought of as a modest extension of the 
ways/tag-match architecture of a normal cache.  some of the line state
transitions involve interaction with other cores' caches, but it isn't 
inherently expensive in either space, time or power.  Intel has talked
a bit about the ring bus and how its design was optimized for coherency
traffic (dual-direction rings, each a cacheline wide, with control/coherency
lanes replicated 2x for each.  one clock per hop.  in spite of being
quite wide, the ring doesn't seem to dominate the die photos.  onchip
point-to-point links are not, afaik, difficult, especially short ones.

> The gpgpu hardware doesn't have cache coherency.

well, a Cuda programmer _does_ have to worry about various forms of 
data synchronization.  in fact, that's really the main reason why the 
Cuda programming model is so strange to port to.

> This is why we have
> so many cores in such a short period of time at the
> gpu's.


let's compare Phi to the Fermi generation.  Phi has 60 macro-cores,
each 4-threaded and 512b wide.  Fermi has 16 SMs, each with 32 pseudo-cores,
but this is really just SIMD.  from a DP perspective, it's really only
16x wide SIMD.  that is, 512b...

there are some other differences: Phi has conventional-appearing registers
(32x512b, probably per-thread - 4x) versus Fermi's 32kx32b shared among 
all threads.  Phi has conventional caches (32k L1, 512K coherent L2); 
Fermi has 64K L1 storage per SM as well, but can turn off tag-related
behavior to form per-SM "shared" memory (and 768K shared L2).

I see no reason to think that Phi's performance will be hurt by the 
directory-based L2 coherency implemented on its ring bus.  Intel's 
clearly been working on low-latency rings for a long time - they're 
forgiving, logically simple and can be quite high-performing.

really, the big difference between Cuda and Phi is the vector programming
model.  Cuda presents a vector architecture as threads, even though you 
can end up with a block containing 31 annulled threads that still consumes
cycles (Nvidia talks about the annulling as "divergence".)  Intel provides
a more conventional masked 512b SIMD - sort of *explicitly* annulling.

I haven't see any good technical discussions about latencies, which is 
what will really bite code that is not EP and low-memory.  in a sense,
both systems are ideal for financial MC, and fairly good for dense 
matrix-matrix math.  for other stuff (say, adaptive mesh astro code),
I'd say Phi has an architectural advantage simply because it has more 
independent cores (60 vs 16) and a somewhat more irregularity-forgiving
memory architecture (distributed coherent L2).

> From answers from engineers i understand the reason why most normal
> cpu's do not have more cores is because of the cache coherency. More

nonsense.  conventional CPUs do not have more cores because most people 
want few-fast cores.  look at AMD's bulldozer/piledriver effort: they 
provided a lot more cores and all the reviews pissed on them.

> cores are a cache coherency nightmare.


> Cache coherency is very costly to maintain in a cpu. So the question


> i want to ask is whether Xeon Phi scales for future
> generations of releases of it.

I think programming models are extremely important: to make the hardware
easier to take advantage of.  Phi seems pretty easy for a single board,
but it's still a bit sticky because it's a PCIE coprocessor (without 
coherent access to host memory, and without any particularly nice way to
connect multiple chips, let alone multiple systems.)  I think a big question
is what level of integration is going to actually drive this market.
are these chips designed for exascale clusters?  if so, scalable interconnect
is probably the main concern at this point.  if there's a significant 
volume market for few-teraflop systems (personal supercomputers), then 
just putting a few cards on a PCIE would work OK.

figuring out how to use stacked memory to provide really big bandwidth
has to be in every architect's mind right now.

I also wonder whether AMD has anyone working on this stuff.  it would be 
fascinating if they took a different approach - say a 20W APU with stacked
dram that could be tiled by the score onto a single board.  in some sense,
that's probably a more manufacturable, power-efficient and scalable approach
than the 300W add-in cards that Nvidia and Intel are pursuing.

> Are they going to modify the AVX2 code to AVX3, so vectors from 1024
> bits in the future, in order to get some extra performance?

I doubt it.  no code is purely vectorizable, and lots of code is merely
scalar.  in a meaningful sense, Nvidia's GPUs are the spiritual decendent
of the Tera/Cray MTA architecture, where programmers almost see an ocean
of threads representing the dataflow graph of their program.  the hardware
schedules threads onto cores whenever arguments are available (ie, usually
at the end of a memory access bubble.)  the current Phi is dealing with 
the same basic workload characteristics, but the real question is: at 
what granularity are threads independently scheduled.

on Nvidia, threads in a block are never broken up, in spite of the fact 
that they may have diverged (logically or due to non-coalesced/contiguous
memory references).  Phi "blocks" are, in some sense, 64-wide, and a single
core only has 4 of them in-flight - Phi's thread scheduler is probably 
quite a bit simpler than Nvidia's.  (MTA is afaikt, the extreme case, where
all threads are independent and not run in fixed batches (blocks).)

> I assume more cores is going to be near impossible to keep coherent.
> 60 is already a lot.

nonsense.  coherency costs nothing in the absence of sharing/conflict.
even when there is sharing, they could scale using an onchip 2d mesh.

regards, mark hahn.
not a chip architect, but hey, make an offer ;)

More information about the Beowulf mailing list