<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>Stu, <br>

    </p>

    <p>I'm welcome to hearing other people's perspectives, especially

      when they have more experience in an area than I do, but I feel

      like your experiences below counter my statements as much as they

      support them. For example, this statement of yours:</p>

    <p>

      <blockquote type="cite">You can not change architecture

        (especially memory) and expect your code to just shift across. 

        It might compile, run and do OK... but you won't get great

        performance.  You have to work at it.  What KNC and KNL give you

        is the ability to shift quickly... and that buys you time to do

        the optimisation.</blockquote>

    </p>

    <p>Seems to be in agreement with this one I made: <br>

    </p>

    <p>

      <blockquote type="cite">Then when Intel's MIC processors finally

        did come out, guess what? You <br>

        *did* have to rewrite your code to get any meaningful increase

        in <br>

        performance. </blockquote>

    </p>

    <p>And remember, my argument wasn't really with the technical

      issues. I agree 100% with everything you said. My argument was

      with Intel's marketing, which criticized GPUs for requiring you to

      rewrite your code, which they claimed would be unnecessary with

      their accelerators (I don't know if they came up with the terms

      MIC or Xeon Phi at that time), but then you still needed to

      "modernize" your code to get the most out of their processors. <br>

    </p>

    <p>I also don't disagree with the "stickiness" of GPUs in the market

      once they became the established approach. However, I did state

      that I had some colleagues who were very supportive of the Xeon

      Phi because you "didn't need to rewrite your code". Those same

      colleagues never ended up investing significantly in the Xeon Phis

      when they did come out, which singifies to me that they may have

      been disappointed with Intel's promises compared to reality. <br>

    </p>

    <pre class="moz-signature" cols="72">Prentice Bisbal

Lead Software Engineer

Princeton Plasma Physics Laboratory

<a class="moz-txt-link-freetext" href="http://www.pppl.gov">http://www.pppl.gov</a></pre>

    <div class="moz-cite-prefix">On 06/19/2018 10:46 PM, Stu Midgley

      wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAEM1RsU+LgugFHfBSJJYg53kKYEO+-buPdrf0PqYC7DBUZnahA@mail.gmail.com">

      <div dir="ltr">I think your comments are wrong.

        <div><br>

        </div>

        <div>I think Intel was right on the mark with their "just throw

          a few flags and it'll work".  For a very large number of codes

          this will work.  If you have a simple FD, stencil or use MKL

          then your away.  For a large number of people that works.</div>

        <div><br>

        </div>

        <div>Our company got up and running in a few days and were

          getting reasonable speedup... enough that we ended up

          purchasing many thousands of KNC's.  We then spent 4 years

          fine tuning, optimising getting every last drop of

          performance.  Which broadened the workload we could run on the

          KNC's and sped up all codes that were on it.</div>

        <div><br>

        </div>

        <div>This is no different to standard Xeon/AMD.  We went through

          exactly the same process when we got 4socket 16core AMD

          systems.  They are very numa, so require you to change how

          your code works.  Every time a new vectorisation comes out, we

          spend a lot of effort recoding to use it...</div>

        <div><br>

        </div>

        <div>The KNL's are even easier.  They are just x86 systems with

          large number of cores and massive vector units.  If the

          compiler can localise your working set to the HBM then they

          absolutely fly.  If you get into the intrinsics, they can

          really scream... which is why we now have many thousands of

          thousands of KNL's.</div>

        <div><br>

        </div>

        <div>You can not change architecture (especially memory) and

          expect your code to just shift across.  It might compile, run

          and do OK... but you won't get great performance.  You have to

          work at it.  What KNC and KNL give you is the ability to shift

          quickly... and that buys you time to do the optimisation.</div>

        <div><br>

        </div>

        <div>Where Intel made the mistake was to assume they could shift

          people from GPUs.  People who have spent years writing and

          optimising won't shift easily... cause they have to go through

          that whole process again.  Getting people back to x86 once

          they have shifted is a long long term goal.  </div>

        <div><br>

        </div>

        <div>And, as I said, Phi isn't dead.  Large vectors, large core

          count with high speed memory - that's Phi.  Intel is just

          shifting that back under the standard Xeon name.</div>

        <div><br>

        </div>

        <div><br>

        </div>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr">On Wed, Jun 20, 2018 at 5:00 AM Prentice Bisbal

          <<a href="mailto:pbisbal@pppl.gov" moz-do-not-send="true">pbisbal@pppl.gov</a>>

          wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0 0 0

          .8ex;border-left:1px #ccc solid;padding-left:1ex">On

          06/19/2018 03:10 PM, Joe Landman wrote:<br>

          <br>

          ><br>

          ><br>

          > On 6/19/18 2:47 PM, Prentice Bisbal wrote:<br>

          >><br>

          >> On 06/13/2018 10:32 PM, Joe Landman wrote:<br>

          >>><br>

          >>> I'm curious about your next gen plans, given

          Phi's roadmap.<br>

          >>><br>

          >>><br>

          >>> On 6/13/18 9:17 PM, Stu Midgley wrote:<br>

          >>>> low level HPC means... lots of things.  BUT

          we are a huge Xeon Phi <br>

          >>>> shop and need low-level programmers ie.

          avx512, careful <br>

          >>>> cache/memory management (NOT openmp/compiler

          vectorisation etc).<br>

          >>><br>

          >>> I played around with avx512 in my rzf code. <br>

          >>> <a

            href="https://github.com/joelandman/rzf/blob/master/avx2/rzf_avx512.c"

            rel="noreferrer" target="_blank" moz-do-not-send="true">https://github.com/joelandman/rzf/blob/master/avx2/rzf_avx512.c</a>

          .  <br>

          >>> Never really spent a great deal of time on it,

          other than noting <br>

          >>> that using avx512 seemed to downclock the core a

          bit on Skylake.<br>

          >><br>

          >> If you organize your code correctly, and call the

          compiler with the <br>

          >> right optimization flags, shouldn't the compiler

          automatically handle <br>

          >> a good portion of this 'low-level' stuff? <br>

          ><br>

          > I wish it would do it well, but it turns out it doesn't

          do a good <br>

          > job.   You have to pay very careful attention to almost

          all aspects of <br>

          > making it simple for the compiler, and then constraining

          the <br>

          > directions it takes with code gen.<br>

          ><br>

          > I explored this with my RZF stuff.  It turns out that

          with -O3, gcc <br>

          > (5.x and 6.x) would convert a library call for the power

          function into <br>

          > an FP instruction.  But it would use 1/8 - 1/4 of the

          XMM/YMM register <br>

          > width, not automatically unroll loops, or leverage the

          vector nature <br>

          > of the problem.<br>

          ><br>

          > Basically, not much has changed in 20+ years ... you

          annotate your <br>

          > code with pragmas and similar, or use instruction

          primitives and give <br>

          > up on the optimizer/code generator.<br>

          ><br>

          > When it comes down to it, compilers aren't really as

          smart as many of <br>

          > us would like.  Converting idiomatic code into efficient

          assembly <br>

          > isn't what they are designed for.  Rather correct

          assembly.  Correct <br>

          > doesn't mean efficient in many cases, and some of the

          less obvious <br>

          > optimizations that we might think to be beneficial are

          not taken. We <br>

          > can hand modify the code for this, and see if these

          optimizations are <br>

          > beneficial, but the compilers often are not looking at a

          holistic <br>

          > problem.<br>

          ><br>

          >> I understand that hand-coding this stuff usually

          still give you the <br>

          >> best performance (See GotoBLAS/OpenBLAS, for

          example), but does your <br>

          >> average HPC programmer trying to get decent

          performance need to <br>

          >> hand-code that stuff, too?<br>

          ><br>

          > Generally, yes.  Optimizing serial code for GPUs doesn't

          work well. <br>

          > Rewriting for GPUs (e.g. taking into account the GPU

          data/compute flow <br>

          > architecture) does work well.<br>

          ><br>

          <br>

          Thanks for the reply. This sounds like the perfect opportunity

          for me to <br>

          rant about Intel's marketing for Xeon Phi vs. GPUs. When GPUs

          took off <br>

          and Intel was formulating their answer to GPUs, they kept

          saying you <br>

          wouldn't need to rewrite your code like you need to for GPUs.

          You could <br>

          just recompile and everything would work on the new MIC

          processors.<br>

          <br>

          Then when Intel's MIC processors finally did come out, guess

          what? You <br>

          *did* have to rewrite your code to get any meaningful increase

          in <br>

          performance. For example, you'd have to make sure your loops

          were <br>

          data-parallel and use OpenMP or TBB, or Cilk Plus or whatever,

          to really <br>

          take advantage of the MIC.  This meant you had to rewrite your

          code, but <br>

          Intel did everything they could to avoid admitting you would

          need to <br>

          rewrite your code. Instead, they used the euphemism 'code

          modernization' <br>

          instead.<br>

          <br>

          I often wonder if that misleading marketing is one of the

          reasons why <br>

          the Xeon Phi has already been canned. I know a lot of people

          who were <br>

          excited for the Xeon Phi, but I don't know any who ever bought

          the Xeon <br>

          Phis once they came out.<br>

          <br>

          Prentice<br>

          _______________________________________________<br>

          Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org"

            target="_blank" moz-do-not-send="true">Beowulf@beowulf.org</a>

          sponsored by Penguin Computing<br>

          To change your subscription (digest mode or unsubscribe) visit

          <a href="http://www.beowulf.org/mailman/listinfo/beowulf"

            rel="noreferrer" target="_blank" moz-do-not-send="true">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>

        </blockquote>

      </div>

      <br clear="all">

      <div><br>

      </div>

      -- <br>

      <div dir="ltr" class="gmail_signature"

        data-smartmail="gmail_signature">

        <div dir="ltr">Dr Stuart Midgley<br>

          <a href="mailto:sdm900@gmail.com" target="_blank"

            moz-do-not-send="true">sdm900@gmail.com</a></div>

      </div>

    </blockquote>

    <br>

  </body>

</html>