[Beowulf] [External] Re: AMD and AVX512
Prentice Bisbal
pbisbal at pppl.gov
Wed Jun 16 20:39:39 UTC 2021
Scott (and Michael and Carlos),
Thanks for your excellent feedback. That's the kind of enlightening
feedback I was looking for. Interesting that the HBM on Fugaku exceeds
the needs of the processor.
Prentice
On 6/16/21 2:23 PM, Scott Atchley wrote:
> On Wed, Jun 16, 2021 at 1:15 PM Prentice Bisbal via Beowulf
> <beowulf at beowulf.org <mailto:beowulf at beowulf.org>> wrote:
>
> Did anyone else attend this webinar panel discussion with AMD
> hosted by
> HPCWire yesterday? It was titled "AMD HPC Solutions: Enabling Your
> Success in HPC"
>
> https://www.hpcwire.com/amd-hpc-solutions-enabling-your-success-in-hpc/
> <https://www.hpcwire.com/amd-hpc-solutions-enabling-your-success-in-hpc/>
>
> I attended it, and noticed there was no mention of AMD supporting
> AVX512, so during the question and answer portion of the program, I
> asked when AMD processors will support AVX512. The answer given,
> and I'm
> not making this up, is that AMD listens to their users and gives the
> users what they want, and right now they're not hearing any demand
> for
> AVX512.
>
> Personally, I call BS on that one. I can't imagine anyone in the HPC
> community saying "we'd like processors that offer only 1/2 the
> floating
> point performance of Intel processors". Sure, AMD can offer more
> cores,
> but with only AVX2, you'd need twice as many cores as Intel
> processors,
> all other things being equal.
>
> Last fall I evaluated potential new cluster nodes for a large cluster
> purchase using the HPL benchmark. I compared a server with dual
> AMD EPYC
> 7H12 processors (128) cores to a server with quad Intel Xeon 8268
> processors (96 cores). I measured 5,389 GFLOPS for the Xeon 8268, and
> only 3,446.00 GFLOPS for the AMD 7H12. That's LINPACK score that only
> 64% of the Xeon 8268 system, despite having 33% more cores.
>
> From what I've heard, the AMD processors run much hotter than the
> Intel
> processors, too, so I imagine a FLOPS/Watt comparison would be
> even less
> favorable to AMD.
>
> An argument can be made that for calculations that lend themselves to
> vectorization should be done on GPUs, instead of the main
> processors but
> the last time I checked, GPU jobs are still memory is limited, and
> moving data in and out of GPU memory can still take time, so I can
> see
> situations where for large amounts of data using CPUs would be
> preferred
> over GPUs.
>
> Your thoughts?
>
> --
> Prentice
>
>
> AMD has studied this quite a bit in DOE's FastForward-2 and
> PathForward. I think Carlos' comment is on track. Having a unit that
> cannot be fed data quick enough is pointless. It is application
> dependent. If your working set fits in cache, then the vector units
> work well. If not, you have to move data which stalls compute
> pipelines. NERSC saw only a 10% increase in performance when moving
> from low core count Xeon CPUs with AVX2 to Knights Landing with many
> cores and AVX-512 when it should have seen an order of magnitude
> increase. Although Knights Landing had MCDRAM (Micron's not-quite
> HBM), other constraints limited performance (e.g., lack of enough
> memory references in flight, coherence traffic).
>
> Fujitsu's ARM64 chip with 512b SVE in Fugaku does much better than
> Xeon with AVX-512 (or Knights Landing) because of the High Bandwidth
> Memory (HBM) attached and I assume a larger number of memory
> references in flight. The downside is the lack of memory capacity
> (only 32 GB per node). This shows that it is possible to get more
> performance with a CPU with a 512b vector engine. That said, it is not
> clear that even this CPU design can extract the most from the memory
> bandwidth. If you look at the increase in memory bandwidth from Summit
> to Fugaku, one would expect performance on real apps to increase by
> that amount as well. From the presentations that I have seen, that is
> not always the case. For some apps, the GPU architecture, with its
> coherence on demand rather than with every operation, can extract more
> performance.
>
> AMD will add 512b vectors if/when it makes sense on real apps.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210616/d4d85ca9/attachment.htm>
More information about the Beowulf
mailing list