<div dir="ltr"><div class="gmail_default" style="font-family:monospace,monospace">We have found that running in cached/quadrant mode gives excellent performance. With our codes, the optimal is 2threads per core. KNL broke the model of KNC which did a full context change every clock cycle (so you HAD to have multiple threads per core) which has had the roll-on effect of reducing the number of threads required to run to get maximum performance. However, if your code scales to 128 threads... it probably scales to 240. So it probably doesn't matter.</div><div class="gmail_default" style="font-family:monospace,monospace"><br></div><div class="gmail_default" style="font-family:monospace,monospace">The programming model is much easier than GPU's. We have codes running (extremely fast) on KNL that no one has managed to get running on GPU's (mostly due to the memory model of the GPU).</div><div class="gmail_default" style="font-family:monospace,monospace"><br></div><div class="gmail_default" style="font-family:monospace,monospace">So you shouldn't write them off. No matter which way you turn, you will most likely have x86 and lots of corse... and those cores will have AVX512 going forward (and probably later AVX1024 or what ever they'll call it). So, make sure your code vectorises and has no thread-blocking points.</div><div class="gmail_default" style="font-family:monospace,monospace"><br></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Nov 20, 2017 at 9:09 AM, Richard Walsh <span dir="ltr"><<a href="mailto:rbwcnslt@gmail.com" target="_blank">rbwcnslt@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
Well ...<br>
<br>
KNL is (only?) superior for highly vectorizable codes that at scale can run out of MCDRAM (slow scaler performance). Multiple memory and interconnect modes (requiring a reboot to change) create a programming complexity (e.g managing affinity across 8-9-9-8 tiles in quad mode) that few outside the National Labs were able-interested in managing. Using 4 hyper threads not often useful. When used in cache mode, direct mapped L3 cache suffers gradual perform degradation from fragmentation. Delays in its release and in the tuning of the KNL BIOS for performance shrunk its window of advantage over Xeon line significantly, as well as then new GPUs (Pascal). Meeting performance programming challenges added to this shrink (lots of dungeon sessions), FLOPS per Watt good but not as good as GPU. Programming environment compatibility good, although there are those instruction subsets that are not portable ... got to build with<br>
<br>
-xCOMMON-AVX512 ...<br>
<br>
But as someone said “it is fast” ... I would say maybe now it “was fast” for a comparably short period of time. If you already have 10s of racks and have them figured out then you like the reduced operating cost and may just buy some more as the price drops, but if you did not buy in gen 1 then maybe you are not so disappointed at the change of plans ... and maybe it is time to merge many-core and multi-core anyway.<br>
<br>
Richard Walsh<br>
Thrashing River Computing<br><br>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><font face="monospace, monospace">Dr Stuart Midgley<br><a href="mailto:sdm900@gmail.com" target="_blank">sdm900@gmail.com</a></font></div></div></div>
</div></div>