Hi all,

This is, in my humble opinion, also the big problem CPUs are facing. They
> are
> build to tackle all possible scenarios, from simple integer to floating
> point,
> from in-memory to disc I/O. In some respect it would have been better to
> stick
> with a separate math unit which then could be selected according to your
> workload you want to run on that server. I guess this is where the GPUs
> are
> trying to fit in here, or maybe ARM.

  I recall a few years ago the rumors that the Argonne "A18" system was
going to use the 'Configurable Spatial Accelerators' that Intel was
developing, with the idea being you *could* reconfigure based on the needs
of the code.  In principle, it sounds like the Holy Grail, but in practice
it seems quite difficult, and I don't believe I've heard much more about
the CSA approach since.

WikiChip on the CSA:
NextPlatform article:

  I have to imagine that research hasn't gone fully quiet, especially with
Intel's moves towards oneAPI and their FPGA experiences, but I haven't seen
anything about it in a while.  Of course....

> I also agree with the compiler "problem". If you are starting to push some
> compilers too much, the code is running very fast but the results are
> simply
> wrong. Again, in an ideal world we have a compiler for the job for the
> given
> hardware which also depends on the job you want to run.

 ... It exacerbates the compiler issues, *I think*.  I hesitate to say it
does so definitively, since the patent write-up talks about how the CSA
architecture uses a representation very similar to what the (now old) Intel
compilers created as an IR (intermediate representation).  In my opinion,
having a compiler that can 'do everything' is like having an AI that can do
everything - we're good at very, *very* specific use-cases, but not
generality.  So configurable systems are a big challenge.  (I'm *way* out
of my depth on compilers, though - maybe they're improving massively?)

> Maybe the whole climate problem will finally push HPC into the more
> bespoken
> system where the components are fit for the job in question, say weather
> modeling for example, simply as that would be more energy efficient and
> faster.

  I can't speak to whether climate research will influence hardware, but
back to the *original* theme of this thread, I actually had some data -very
*limited* data, mind you!- on how NCAR's climate model, CESM, run in an
'F2000climo' case (one of many, many cases, and very atmospheric focused)
at 2-degree atmosphere resolution (*very* coarse) on a 36-core Xeon Skylake
performs across AVX2, AVX512 and AVX512+FMA.  By default, FMA is turned off
in these cases due to numerical sensitivity.  So, that's a *very* specific
case, but on the off chance people are curious, here's what it looks like -
note that this is *noisy* data, because the model also does a lot of I/O,
hence why I tend to look at median times, in blue below:

SKX (AWS C5N.18xlarge) Performance Comparison
CESM Case: F2000climo @ f19_g17 resolution
(36 cores each component / 10 model day run, skipping 1st and last)
Flags AVX2 (no FMA) AVX512 (no FMA) AVX512 + FMA
Min 60.18 60.24 59.16
Max 66.26 60.47 59.40
Median 60.28 60.38 59.32

  The take-away?  We're not really benefiting *at all* (at this resolution,
for this compset, etc) from AVX512 here.  Maybe at higher resolution?
Maybe with more vertical levels, or chemistry, or something like that?
*Maybe*, but differences seem indistinguishable from noise here, and
possibly negative!  Now, give us more *memory bandwidth*, and that's
fantastic.  Could this code be rewritten to take better advantage of larger
vectors?  Sure, and some *really* capable people do work on that sort of
stuff, and it helps, but as an *evolution* in performance, not a revolution
in it.

  (Also, I'm always horrified by presenting one-off tests as examples of
anything, but it's the only data I have on-hand!  Other cases may indeed

Before somebody comes along with: but but but it costs! Think about how
> much
> money is being spent simply to kill people, or at other wasteful project
> like
> Brexit etc.

    One can only hope.  When it comes to spending on research, I recall the
   "If you think education is expensive, try ignorance!"

  - Brian

Am Montag, 21. Juni 2021, 14:46:30 BST schrieb Joe Landman:
> > On 6/21/21 9:20 AM, Jonathan Engwall wrote:
> > > I have followed this thinking "square peg, round hole."
> > > You have got it again, Joe. Compilers are your problem.
> >
> > Erp ... did I mess up again?
> >
> > System architecture has been a problem ... making a processing unit
> > 10-100x as fast as its support components means you have to code with
> > that in mind.  A simple `gfortran -O3 mycode.f` won't necessarily
> > generate optimal code for the system ( but I swear ... -O3 ... it says
> > it on the package!)
> >
> > Way back at Scalable, our secret sauce was largely increasing IO
> > bandwidth and lowering IO latency while coupling computing more tightly
> > to this massive IO/network pipe set, combined with intelligence in the
> > kernel on how to better use the resources.  It was simply a better
> > architecture.  We used the same CPUs.  We simply exploited the design
> > better.
> >
> > End result was codes that ran on our systems with off-cpu work (storage,
> > networking, etc.) could push our systems far harder than competitors.
> > And you didn't have to use a different ISA to get these benefits.  No
> > recompilation needed, though we did show the folks who were interested,
> > how to get even better performance.
> >
> > Architecture matters, as does implementation of that architecture.
> > There are costs to every decision within an architecture.  For AVX512,
> > along comes lots of other baggage associated with downclocking, etc.
> > You have to do a cost-benefit analysis on whether or not it is worth
> > paying for that baggage, with the benefits you get from doing so.  Some
> > folks have made that decision towards AVX512, and have been enjoying the
> > benefits of doing so (e.g. willing to pay the costs).  For the general
> > audience, these costs represent a (significant) hurdle one must overcome.
> >
> > Here's where awesome compiler support would help.  FWIW, gcc isn't that
> > great a compiler.  Its not performance minded for HPC. Its a reasonable
> > general purpose standards compliant (for some subset of standards)
> > compilation system.  LLVM is IMO a better compiler system, and its
> > clang/flang are developing nicely, albeit still not really HPC focused.
> > Then you have variants built on that.  Like the Cray compiler, Nvidia
> > compiler and AMD compiler. These are HPC focused, and actually do quite
> > well with some codes (though the AMD version lags the Cray and Nvidia
> > compilers). You've got the Intel compiler, which would be a good general
> > compiler if it wasn't more of a marketing vehicle for Intel processors
> > and their features (hey you got an AMD chip?  you will take the slowest
> > code path even if you support the features needed for the high
> > performance code path).
> >
> > Maybe, someday, we'll get a great HPC compiler for C/Fortran.
