[Beowulf] AMD and AVX512
joe.landman at gmail.com
Sun Jun 20 17:21:15 UTC 2021
(Note: not disagreeing at all with Gerald, actually agreeing strongly
... also, correct address this time! Thanks Gerald!)
On 6/19/21 11:49 AM, Gerald Henriksen wrote:
> On Wed, 16 Jun 2021 13:15:40 -0400, you wrote:
>> The answer given, and I'm
>> not making this up, is that AMD listens to their users and gives the
>> users what they want, and right now they're not hearing any demand for
More accurately, there is call for it. From a very small segment of the
market. Ones who buy small quantities of processors (under 100k volume
That is, not a significant enough portion of the market to make a huge
difference to the supplier (Intel).
And more to the point, AI and HPC joining forces has put the spotlight
on small matrix multiplies, often with lower precision. I'm not sure
(haven't read much on it recently) if AVX512 will be enabling/has
enabled support for bfloat16/FP16 or similar. These tend to go to GPUs
and other accelerators.
>> Personally, I call BS on that one. I can't imagine anyone in the HPC
>> community saying "we'd like processors that offer only 1/2 the floating
>> point performance of Intel processors".
> I suspect that is marketing speak, which roughly translates to not
> that no one has asked for it, but rather requests haven't reached a
> threshold where the requests are viewed as significant enough.
This, precisely. AMD may be losing the AVX512 users to Intel. But
that's a small/miniscule fraction of the overall users of its products.
The demand for this is quite constrained. Moreover, there are often
significant performance consequences to using AVX512 (downclocking,
pipeline stalls, etc.) whereby the cost of enabling it and using it, far
outweighs the benefits of providing it, for the vast, overwhelming
portion of the market.
And, as noted above on the accelerator side, this use case (large
vectors) are better handled by the accelerators. There is a cost
(engineering, code design, etc.) to using accelerators as well. But it
won't directly impact the CPUs.
>> Sure, AMD can offer more cores,
>> but with only AVX2, you'd need twice as many cores as Intel processors,
>> all other things being equal.
... or you run the GPU versions of the code, which are likely getting
more active developer attention. AVX512 applies to only a miniscule
number of codes/problems. Its really not a panacea.
More to the point, have you seen how "well" compilers use AVX2/SSE
registers and do code gen? Its not pretty in general. Would you want
the compilers to purposefully spit out AVX512 code the way the do
AVX2/SSE code now? I've found one has to work very hard with intrinsics
to get good performance out of AVX2, never mind AVX512.
Put another way, we've been hearing about "smart" compilers for a while,
and in all honesty, most can barely implement a standard correctly,
never mind generate reasonably (near) optimal code for the target
system. This has been a problem my entire professional life, and while
I wish they were better, at the end of the day, this is where human
intelligence fits into the HPC/AI narrative.
> But of course all other things aren't equal.
> AVX512 is a mess.
Understated, and yes.
> Look at the Wikipedia page(*) and note that AVX512 means different
> things depending on the processor implementing it.
I made comments previously about which ISA ARM folks were going to write
to. That is, different processors, likely implementing different
instructions, differently ... you won't really have 1 equally good
compiler for all these features. You'll have a compiler that implements
common denominators reasonably well. Which mitigates the benefits of the
Intel has the same problem with AVX512. I know, I know ... feature
flags on the CPU (see last line of lscpu output). And how often have
certain (ahem) compilers ignored the flags, and used a different
mechanism to determine CPU feature support, specifically targeting their
competitor offerings to force (literally) low performance paths for
> So what does the poor software developer target?
Lowest common denominator. Make the code work correctly first. Then
make it fast. If fast is platform specific, ask how often with that
platform be used.
> Or that it can for heat reasons cause CPU frequency reductions,
> meaning real world performance may not match theoritical - thus easier
> to just go with GPU's.
> The result is that most of the world is quite happily (at least for
> now) ignoring AVX512 and going with GPU's as necessary - particularly
> given the convenient libraries that Nvidia offers.
Yeah ... like it or not, that battle is over (for now).
>> An argument can be made that for calculations that lend themselves to
>> vectorization should be done on GPUs, instead of the main processors but
>> the last time I checked, GPU jobs are still memory is limited, and
>> moving data in and out of GPU memory can still take time, so I can see
>> situations where for large amounts of data using CPUs would be preferred
>> over GPUs.
> AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3,
> which may or may not mean a difference.
It does. IO and memory bandwidth/latency are very important, and oft
overlooked aspects of performance. If you have a choice of doubling IO
and memory bandwidth at lower latency (usable by everyone) vs adding an
AVX512 unit or two (usable by a small fraction of a percent of all
users), which would net you, as an architect, the best "bang for the buck"?
> But what despite all of the above and the other replies, it is AMD who
> has been winning the HPC contracts of late, not Intel.
There's a reason for that. I will admit I have a devil of a time trying
to convince people that higher clock frequency for computing matters
only to a small fraction of operations, especially ones waiting on
(slow) RAM and (slower) IO. Make the RAM and IO faster (lower latency,
higher bandwidth), and the system will be far more performant.
e:joe.landman at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf