[Beowulf] AMD and AVX512
bdobbins at gmail.com
Mon Jun 21 18:39:06 UTC 2021
This is, in my humble opinion, also the big problem CPUs are facing. They
> build to tackle all possible scenarios, from simple integer to floating
> from in-memory to disc I/O. In some respect it would have been better to
> with a separate math unit which then could be selected according to your
> workload you want to run on that server. I guess this is where the GPUs
> trying to fit in here, or maybe ARM.
I recall a few years ago the rumors that the Argonne "A18" system was
going to use the 'Configurable Spatial Accelerators' that Intel was
developing, with the idea being you *could* reconfigure based on the needs
of the code. In principle, it sounds like the Holy Grail, but in practice
it seems quite difficult, and I don't believe I've heard much more about
the CSA approach since.
WikiChip on the CSA:
I have to imagine that research hasn't gone fully quiet, especially with
Intel's moves towards oneAPI and their FPGA experiences, but I haven't seen
anything about it in a while. Of course....
> I also agree with the compiler "problem". If you are starting to push some
> compilers too much, the code is running very fast but the results are
> wrong. Again, in an ideal world we have a compiler for the job for the
> hardware which also depends on the job you want to run.
... It exacerbates the compiler issues, *I think*. I hesitate to say it
does so definitively, since the patent write-up talks about how the CSA
architecture uses a representation very similar to what the (now old) Intel
compilers created as an IR (intermediate representation). In my opinion,
having a compiler that can 'do everything' is like having an AI that can do
everything - we're good at very, *very* specific use-cases, but not
generality. So configurable systems are a big challenge. (I'm *way* out
of my depth on compilers, though - maybe they're improving massively?)
> Maybe the whole climate problem will finally push HPC into the more
> system where the components are fit for the job in question, say weather
> modeling for example, simply as that would be more energy efficient and
I can't speak to whether climate research will influence hardware, but
back to the *original* theme of this thread, I actually had some data -very
*limited* data, mind you!- on how NCAR's climate model, CESM, run in an
'F2000climo' case (one of many, many cases, and very atmospheric focused)
at 2-degree atmosphere resolution (*very* coarse) on a 36-core Xeon Skylake
performs across AVX2, AVX512 and AVX512+FMA. By default, FMA is turned off
in these cases due to numerical sensitivity. So, that's a *very* specific
case, but on the off chance people are curious, here's what it looks like -
note that this is *noisy* data, because the model also does a lot of I/O,
hence why I tend to look at median times, in blue below:
SKX (AWS C5N.18xlarge) Performance Comparison
CESM Case: F2000climo @ f19_g17 resolution
(36 cores each component / 10 model day run, skipping 1st and last)
Flags AVX2 (no FMA) AVX512 (no FMA) AVX512 + FMA
Min 60.18 60.24 59.16
Max 66.26 60.47 59.40
Median 60.28 60.38 59.32
The take-away? We're not really benefiting *at all* (at this resolution,
for this compset, etc) from AVX512 here. Maybe at higher resolution?
Maybe with more vertical levels, or chemistry, or something like that?
*Maybe*, but differences seem indistinguishable from noise here, and
possibly negative! Now, give us more *memory bandwidth*, and that's
fantastic. Could this code be rewritten to take better advantage of larger
vectors? Sure, and some *really* capable people do work on that sort of
stuff, and it helps, but as an *evolution* in performance, not a revolution
(Also, I'm always horrified by presenting one-off tests as examples of
anything, but it's the only data I have on-hand! Other cases may indeed
Before somebody comes along with: but but but it costs! Think about how
> money is being spent simply to kill people, or at other wasteful project
> Brexit etc.
One can only hope. When it comes to spending on research, I recall the
"If you think education is expensive, try ignorance!"
Am Montag, 21. Juni 2021, 14:46:30 BST schrieb Joe Landman:
> > On 6/21/21 9:20 AM, Jonathan Engwall wrote:
> > > I have followed this thinking "square peg, round hole."
> > > You have got it again, Joe. Compilers are your problem.
> > Erp ... did I mess up again?
> > System architecture has been a problem ... making a processing unit
> > 10-100x as fast as its support components means you have to code with
> > that in mind. A simple `gfortran -O3 mycode.f` won't necessarily
> > generate optimal code for the system ( but I swear ... -O3 ... it says
> > it on the package!)
> > Way back at Scalable, our secret sauce was largely increasing IO
> > bandwidth and lowering IO latency while coupling computing more tightly
> > to this massive IO/network pipe set, combined with intelligence in the
> > kernel on how to better use the resources. It was simply a better
> > architecture. We used the same CPUs. We simply exploited the design
> > better.
> > End result was codes that ran on our systems with off-cpu work (storage,
> > networking, etc.) could push our systems far harder than competitors.
> > And you didn't have to use a different ISA to get these benefits. No
> > recompilation needed, though we did show the folks who were interested,
> > how to get even better performance.
> > Architecture matters, as does implementation of that architecture.
> > There are costs to every decision within an architecture. For AVX512,
> > along comes lots of other baggage associated with downclocking, etc.
> > You have to do a cost-benefit analysis on whether or not it is worth
> > paying for that baggage, with the benefits you get from doing so. Some
> > folks have made that decision towards AVX512, and have been enjoying the
> > benefits of doing so (e.g. willing to pay the costs). For the general
> > audience, these costs represent a (significant) hurdle one must overcome.
> > Here's where awesome compiler support would help. FWIW, gcc isn't that
> > great a compiler. Its not performance minded for HPC. Its a reasonable
> > general purpose standards compliant (for some subset of standards)
> > compilation system. LLVM is IMO a better compiler system, and its
> > clang/flang are developing nicely, albeit still not really HPC focused.
> > Then you have variants built on that. Like the Cray compiler, Nvidia
> > compiler and AMD compiler. These are HPC focused, and actually do quite
> > well with some codes (though the AMD version lags the Cray and Nvidia
> > compilers). You've got the Intel compiler, which would be a good general
> > compiler if it wasn't more of a marketing vehicle for Intel processors
> > and their features (hey you got an AMD chip? you will take the slowest
> > code path even if you support the features needed for the high
> > performance code path).
> > Maybe, someday, we'll get a great HPC compiler for C/Fortran.
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf