[Beowulf] Working for DUG, new thead

Prentice Bisbal pbisbal at pppl.gov
Wed Jun 20 07:02:55 PDT 2018


Stu,

I'm welcome to hearing other people's perspectives, especially when they 
have more experience in an area than I do, but I feel like your 
experiences below counter my statements as much as they support them. 
For example, this statement of yours:

> You can not change architecture (especially memory) and expect your 
> code to just shift across. It might compile, run and do OK... but you 
> won't get great performance.  You have to work at it.  What KNC and 
> KNL give you is the ability to shift quickly... and that buys you time 
> to do the optimisation.

Seems to be in agreement with this one I made:

> Then when Intel's MIC processors finally did come out, guess what? You
> *did* have to rewrite your code to get any meaningful increase in
> performance. 

And remember, my argument wasn't really with the technical issues. I 
agree 100% with everything you said. My argument was with Intel's 
marketing, which criticized GPUs for requiring you to rewrite your code, 
which they claimed would be unnecessary with their accelerators (I don't 
know if they came up with the terms MIC or Xeon Phi at that time), but 
then you still needed to "modernize" your code to get the most out of 
their processors.

I also don't disagree with the "stickiness" of GPUs in the market once 
they became the established approach. However, I did state that I had 
some colleagues who were very supportive of the Xeon Phi because you 
"didn't need to rewrite your code". Those same colleagues never ended up 
investing significantly in the Xeon Phis when they did come out, which 
singifies to me that they may have been disappointed with Intel's 
promises compared to reality.

Prentice Bisbal
Lead Software Engineer
Princeton Plasma Physics Laboratory
http://www.pppl.gov

On 06/19/2018 10:46 PM, Stu Midgley wrote:
> I think your comments are wrong.
>
> I think Intel was right on the mark with their "just throw a few flags 
> and it'll work".  For a very large number of codes this will work.  If 
> you have a simple FD, stencil or use MKL then your away.  For a large 
> number of people that works.
>
> Our company got up and running in a few days and were getting 
> reasonable speedup... enough that we ended up purchasing many 
> thousands of KNC's.  We then spent 4 years fine tuning, optimising 
> getting every last drop of performance.  Which broadened the workload 
> we could run on the KNC's and sped up all codes that were on it.
>
> This is no different to standard Xeon/AMD.  We went through exactly 
> the same process when we got 4socket 16core AMD systems.  They are 
> very numa, so require you to change how your code works.  Every time a 
> new vectorisation comes out, we spend a lot of effort recoding to use 
> it...
>
> The KNL's are even easier.  They are just x86 systems with large 
> number of cores and massive vector units.  If the compiler can 
> localise your working set to the HBM then they absolutely fly.  If you 
> get into the intrinsics, they can really scream... which is why we now 
> have many thousands of thousands of KNL's.
>
> You can not change architecture (especially memory) and expect your 
> code to just shift across.  It might compile, run and do OK... but you 
> won't get great performance.  You have to work at it.  What KNC and 
> KNL give you is the ability to shift quickly... and that buys you time 
> to do the optimisation.
>
> Where Intel made the mistake was to assume they could shift people 
> from GPUs.  People who have spent years writing and optimising won't 
> shift easily... cause they have to go through that whole process 
> again.  Getting people back to x86 once they have shifted is a long 
> long term goal.
>
> And, as I said, Phi isn't dead.  Large vectors, large core count with 
> high speed memory - that's Phi.  Intel is just shifting that back 
> under the standard Xeon name.
>
>
>
> On Wed, Jun 20, 2018 at 5:00 AM Prentice Bisbal <pbisbal at pppl.gov 
> <mailto:pbisbal at pppl.gov>> wrote:
>
>     On 06/19/2018 03:10 PM, Joe Landman wrote:
>
>     >
>     >
>     > On 6/19/18 2:47 PM, Prentice Bisbal wrote:
>     >>
>     >> On 06/13/2018 10:32 PM, Joe Landman wrote:
>     >>>
>     >>> I'm curious about your next gen plans, given Phi's roadmap.
>     >>>
>     >>>
>     >>> On 6/13/18 9:17 PM, Stu Midgley wrote:
>     >>>> low level HPC means... lots of things.  BUT we are a huge
>     Xeon Phi
>     >>>> shop and need low-level programmers ie. avx512, careful
>     >>>> cache/memory management (NOT openmp/compiler vectorisation etc).
>     >>>
>     >>> I played around with avx512 in my rzf code.
>     >>> https://github.com/joelandman/rzf/blob/master/avx2/rzf_avx512.c .
>     >>> Never really spent a great deal of time on it, other than noting
>     >>> that using avx512 seemed to downclock the core a bit on Skylake.
>     >>
>     >> If you organize your code correctly, and call the compiler with
>     the
>     >> right optimization flags, shouldn't the compiler automatically
>     handle
>     >> a good portion of this 'low-level' stuff?
>     >
>     > I wish it would do it well, but it turns out it doesn't do a good
>     > job.   You have to pay very careful attention to almost all
>     aspects of
>     > making it simple for the compiler, and then constraining the
>     > directions it takes with code gen.
>     >
>     > I explored this with my RZF stuff.  It turns out that with -O3, gcc
>     > (5.x and 6.x) would convert a library call for the power
>     function into
>     > an FP instruction.  But it would use 1/8 - 1/4 of the XMM/YMM
>     register
>     > width, not automatically unroll loops, or leverage the vector
>     nature
>     > of the problem.
>     >
>     > Basically, not much has changed in 20+ years ... you annotate your
>     > code with pragmas and similar, or use instruction primitives and
>     give
>     > up on the optimizer/code generator.
>     >
>     > When it comes down to it, compilers aren't really as smart as
>     many of
>     > us would like.  Converting idiomatic code into efficient assembly
>     > isn't what they are designed for.  Rather correct assembly. 
>     Correct
>     > doesn't mean efficient in many cases, and some of the less obvious
>     > optimizations that we might think to be beneficial are not
>     taken. We
>     > can hand modify the code for this, and see if these
>     optimizations are
>     > beneficial, but the compilers often are not looking at a holistic
>     > problem.
>     >
>     >> I understand that hand-coding this stuff usually still give you
>     the
>     >> best performance (See GotoBLAS/OpenBLAS, for example), but does
>     your
>     >> average HPC programmer trying to get decent performance need to
>     >> hand-code that stuff, too?
>     >
>     > Generally, yes.  Optimizing serial code for GPUs doesn't work well.
>     > Rewriting for GPUs (e.g. taking into account the GPU
>     data/compute flow
>     > architecture) does work well.
>     >
>
>     Thanks for the reply. This sounds like the perfect opportunity for
>     me to
>     rant about Intel's marketing for Xeon Phi vs. GPUs. When GPUs took
>     off
>     and Intel was formulating their answer to GPUs, they kept saying you
>     wouldn't need to rewrite your code like you need to for GPUs. You
>     could
>     just recompile and everything would work on the new MIC processors.
>
>     Then when Intel's MIC processors finally did come out, guess what?
>     You
>     *did* have to rewrite your code to get any meaningful increase in
>     performance. For example, you'd have to make sure your loops were
>     data-parallel and use OpenMP or TBB, or Cilk Plus or whatever, to
>     really
>     take advantage of the MIC.  This meant you had to rewrite your
>     code, but
>     Intel did everything they could to avoid admitting you would need to
>     rewrite your code. Instead, they used the euphemism 'code
>     modernization'
>     instead.
>
>     I often wonder if that misleading marketing is one of the reasons why
>     the Xeon Phi has already been canned. I know a lot of people who were
>     excited for the Xeon Phi, but I don't know any who ever bought the
>     Xeon
>     Phis once they came out.
>
>     Prentice
>     _______________________________________________
>     Beowulf mailing list, Beowulf at beowulf.org
>     <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>     To change your subscription (digest mode or unsubscribe) visit
>     http://www.beowulf.org/mailman/listinfo/beowulf
>
>
>
> -- 
> Dr Stuart Midgley
> sdm900 at gmail.com <mailto:sdm900 at gmail.com>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20180620/f50e7d24/attachment-0001.html>


More information about the Beowulf mailing list