[Beowulf] AMD and AVX512

Sun Jun 20 17:21:15 UTC 2021

(Note:  not disagreeing at all with Gerald, actually agreeing strongly 
... also, correct address this time!  Thanks Gerald!)

On 6/19/21 11:49 AM, Gerald Henriksen wrote:
> On Wed, 16 Jun 2021 13:15:40 -0400, you wrote:
>
>> The answer given, and I'm
>> not making this up, is that AMD listens to their users and gives the
>> users what they want, and right now they're not hearing any demand for
>> AVX512.

More accurately, there is call for it.  From a very small segment of the 
market.  Ones who buy small quantities of processors (under 100k volume 
per purchase).

That is, not a significant enough portion of the market to make a huge 
difference to the supplier (Intel).

And more to the point, AI and HPC joining forces has put the spotlight 
on small matrix multiplies, often with lower precision.  I'm not sure 
(haven't read much on it recently) if AVX512 will be enabling/has 
enabled support for bfloat16/FP16 or similar.  These tend to go to GPUs 
and other accelerators.

>> Personally, I call BS on that one. I can't imagine anyone in the HPC
>> community saying "we'd like processors that offer only 1/2 the floating
>> point performance of Intel processors".
> I suspect that is marketing speak, which roughly translates to not
> that no one has asked for it, but rather requests haven't reached a
> threshold where the requests are viewed as significant enough.

This, precisely.  AMD may be losing the AVX512 users to Intel. But 
that's a small/miniscule fraction of the overall users of its products.  
The demand for this is quite constrained. Moreover, there are often 
significant performance consequences to using AVX512 (downclocking, 
pipeline stalls, etc.) whereby the cost of enabling it and using it, far 
outweighs the benefits of providing it, for the vast, overwhelming 
portion of the market.

And, as noted above on the accelerator side, this use case (large 
vectors) are better handled by the accelerators.  There is a cost 
(engineering, code design, etc.) to using accelerators as well.  But it 
won't directly impact the CPUs.

>> Sure, AMD can offer more cores,
>> but with only AVX2, you'd need twice as many cores as Intel processors,
>> all other things being equal.

... or you run the GPU versions of the code, which are likely getting 
more active developer attention.  AVX512 applies to only a miniscule 
number of codes/problems.  Its really not a panacea.

More to the point, have you seen how "well" compilers use AVX2/SSE 
registers and do code gen?  Its not pretty in general. Would you want 
the compilers to purposefully spit out AVX512 code the way the do 
AVX2/SSE code now?  I've found one has to work very hard with intrinsics 
to get good performance out of AVX2, never mind AVX512.

Put another way, we've been hearing about "smart" compilers for a while, 
and in all honesty, most can barely implement a standard correctly, 
never mind generate reasonably (near) optimal code for the target 
system.  This has been a problem my entire professional life, and while 
I wish they were better, at the end of the day, this is where human 
intelligence fits into the HPC/AI narrative.

> But of course all other things aren't equal.
>
> AVX512 is a mess.

Understated, and yes.

> Look at the Wikipedia page(*) and note that AVX512 means different
> things depending on the processor implementing it.

I made comments previously about which ISA ARM folks were going to write 
to.  That is, different processors, likely implementing different 
instructions, differently ... you won't really have 1 equally good 
compiler for all these features.  You'll have a compiler that implements 
common denominators reasonably well. Which mitigates the benefits of the 
ISA/architecture.

Intel has the same problem with AVX512.  I know, I know ... feature 
flags on the CPU (see last line of lscpu output).  And how often have 
certain (ahem) compilers ignored the flags, and used a different 
mechanism to determine CPU feature support, specifically targeting their 
competitor offerings to force (literally) low performance paths for 
those CPUs?

> So what does the poor software developer target?

Lowest common denominator.  Make the code work correctly first.  Then 
make it fast.  If fast is platform specific, ask how often with that 
platform be used.

> Or that it can for heat reasons cause CPU frequency reductions,
> meaning real world performance may not match theoritical - thus easier
> to just go with GPU's.
>
> The result is that most of the world is quite happily (at least for
> now) ignoring AVX512 and going with GPU's as necessary - particularly
> given the convenient libraries that Nvidia offers.

Yeah ... like it or not, that battle is over (for now).

[...]

>
>> An argument can be made that for calculations that lend themselves to
>> vectorization should be done on GPUs, instead of the main processors but
>> the last time I checked, GPU jobs are still memory is limited, and
>> moving data in and out of GPU memory can still take time, so I can see
>> situations where for large amounts of data using CPUs would be preferred
>> over GPUs.
> AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3,
> which may or may not mean a difference.

It does.  IO and memory bandwidth/latency are very important, and oft 
overlooked aspects of performance.  If you have a choice of doubling IO 
and memory bandwidth at lower latency (usable by everyone) vs adding an 
AVX512 unit or two (usable by a small fraction of a percent of all 
users), which would net you, as an architect, the best "bang for the buck"?

> But what despite all of the above and the other replies, it is AMD who
> has been winning the HPC contracts of late, not Intel.

There's a reason for that.  I will admit I have a devil of a time trying 
to convince people that higher clock frequency for computing matters 
only to a small fraction of operations, especially ones waiting on 
(slow) RAM and (slower) IO.  Make the RAM and IO faster (lower latency, 
higher bandwidth), and the system will be far more performant.

-- 

Joe Landman
e:joe.landman at gmail.com
t: @hpcjoe
w:https://scalability.org
g:https://github.com/joelandman
l:https://www.linkedin.com/in/joelandman

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210620/4440d7b6/attachment-0001.htm>