[Beowulf] Intel kills Knights Hill, Xeon Phi line "being revised"

Mon Nov 20 12:47:22 PST 2017

> So, make sure your code vectorises and has no thread-blocking points.

I remember hearing this when learning MPI as an undergrad back in the
90's...  It's probably always been true!

On Sun, Nov 19, 2017 at 7:22 PM, Stu Midgley <sdm900 at gmail.com> wrote:

> We have found that running in cached/quadrant mode gives excellent
> performance.  With our codes, the optimal is 2threads per core.  KNL broke
> the model of KNC which did a full context change every clock cycle (so you
> HAD to have multiple threads per core) which has had the roll-on effect of
> reducing the number of threads required to run to get maximum performance.
> However, if your code scales to 128 threads... it probably scales to 240.
> So it probably doesn't matter.
>
> The programming model is much easier than GPU's.  We have codes running
> (extremely fast) on KNL that no one has managed to get running on GPU's
> (mostly due to the memory model of the GPU).
>
> So you shouldn't write them off.  No matter which way you turn, you will
> most likely have x86 and lots of corse... and those cores will have AVX512
> going forward (and probably later AVX1024 or what ever they'll call it).
> So, make sure your code vectorises and has no thread-blocking points.
>
>
> On Mon, Nov 20, 2017 at 9:09 AM, Richard Walsh <rbwcnslt at gmail.com> wrote:
>
>>
>> Well ...
>>
>> KNL is (only?) superior for highly vectorizable codes that at scale can
>> run out of MCDRAM (slow scaler performance). Multiple memory and
>> interconnect modes (requiring a reboot to change) create a programming
>> complexity (e.g managing affinity across 8-9-9-8 tiles in quad mode) that
>> few outside the National Labs were able-interested in managing. Using 4
>> hyper threads not often useful. When used in cache mode, direct mapped L3
>> cache suffers gradual perform degradation from fragmentation.  Delays in
>> its release and in the tuning of the KNL BIOS for performance shrunk its
>> window of advantage over Xeon line significantly, as well as then new GPUs
>> (Pascal).  Meeting performance programming challenges added to this shrink
>> (lots of dungeon sessions), FLOPS per Watt good but not as good as GPU.
>> Programming environment compatibility good, although there are those
>> instruction subsets that are not portable ... got to build with
>>
>> -xCOMMON-AVX512 ...
>>
>> But as someone said “it is fast” ... I would say maybe now it “was fast”
>> for a comparably short period of time.  If you already have 10s of racks
>> and have them figured out then you like the reduced operating cost and may
>> just buy some more as the price drops, but if you did not buy in gen 1 then
>> maybe you are not so disappointed at the change of plans ... and maybe it
>> is time to merge many-core and multi-core anyway.
>>
>> Richard Walsh
>> Thrashing River Computing
>>
>>
>
>
> --
> Dr Stuart Midgley
> sdm900 at gmail.com
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>

-- 
- - - - - - -   - - - - - - -   - - - - - - -
Nathan Moore
Mississippi River and 44th Parallel
- - - - - - -   - - - - - - -   - - - - - - -
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20171120/ccede314/attachment.html>