[Beowulf] Intel kills Knights Hill, Xeon Phi line "being revised"

Stu Midgley sdm900 at gmail.com
Sun Nov 19 17:22:53 PST 2017

We have found that running in cached/quadrant mode gives excellent
performance.  With our codes, the optimal is 2threads per core.  KNL broke
the model of KNC which did a full context change every clock cycle (so you
HAD to have multiple threads per core) which has had the roll-on effect of
reducing the number of threads required to run to get maximum performance.
However, if your code scales to 128 threads... it probably scales to 240.
So it probably doesn't matter.

The programming model is much easier than GPU's.  We have codes running
(extremely fast) on KNL that no one has managed to get running on GPU's
(mostly due to the memory model of the GPU).

So you shouldn't write them off.  No matter which way you turn, you will
most likely have x86 and lots of corse... and those cores will have AVX512
going forward (and probably later AVX1024 or what ever they'll call it).
So, make sure your code vectorises and has no thread-blocking points.

On Mon, Nov 20, 2017 at 9:09 AM, Richard Walsh <rbwcnslt at gmail.com> wrote:

> Well ...
> KNL is (only?) superior for highly vectorizable codes that at scale can
> run out of MCDRAM (slow scaler performance). Multiple memory and
> interconnect modes (requiring a reboot to change) create a programming
> complexity (e.g managing affinity across 8-9-9-8 tiles in quad mode) that
> few outside the National Labs were able-interested in managing. Using 4
> hyper threads not often useful. When used in cache mode, direct mapped L3
> cache suffers gradual perform degradation from fragmentation.  Delays in
> its release and in the tuning of the KNL BIOS for performance shrunk its
> window of advantage over Xeon line significantly, as well as then new GPUs
> (Pascal).  Meeting performance programming challenges added to this shrink
> (lots of dungeon sessions), FLOPS per Watt good but not as good as GPU.
> Programming environment compatibility good, although there are those
> instruction subsets that are not portable ... got to build with
> -xCOMMON-AVX512 ...
> But as someone said “it is fast” ... I would say maybe now it “was fast”
> for a comparably short period of time.  If you already have 10s of racks
> and have them figured out then you like the reduced operating cost and may
> just buy some more as the price drops, but if you did not buy in gen 1 then
> maybe you are not so disappointed at the change of plans ... and maybe it
> is time to merge many-core and multi-core anyway.
> Richard Walsh
> Thrashing River Computing

Dr Stuart Midgley
sdm900 at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20171120/f3931bc2/attachment-0001.html>

More information about the Beowulf mailing list