[Beowulf] [External] anyone have modern interconnect metrics?
Prentice Bisbal
pbisbal at pppl.gov
Mon Jan 22 16:54:33 UTC 2024
On 1/22/24 11:38 AM, Scott Atchley wrote:
> On Mon, Jan 22, 2024 at 11:16 AM Prentice Bisbal <pbisbal at pppl.gov> wrote:
>
>> <snip>
>>
>> > Another interesting topic is that nodes are becoming
>> many-core - any
>> > thoughts?
>>
>> Core counts are getting too high to be of use in HPC. High
>> core-count
>> processors sound great until you realize that all those cores
>> are now
>> competing for same memory bandwidth and network bandwidth,
>> neither of
>> which increase with core-count.
>>
>> Last April we were evaluating test systems from different
>> vendors for a
>> cluster purchase. One of our test users does a lot of CFD
>> simulations
>> that are very sensitive to mem bandwidth. While he was
>> getting a 50%
>> speed up in AMD compared to Intel (which makes sense since
>> AMDs require
>> 12 DIMM slots to be filled instead of Intel's 8), he asked us
>> consider
>> servers with LESS cores. Even with the AMDs, he was
>> saturating the
>> memory bandwidth before scaling to all the cores, causing his
>> performance to plateau. For him, buying cheaper processors
>> with lower
>> core-counts was better for him, since the savings would allow
>> us to by
>> additional nodes, which would be more beneficial to him.
>>
>>
>> We see this as well in DOE especially when GPUs are doing a
>> significant amount of the work.
>
> Yeah, I noticed that Frontier and Aurora will actually be
> single-socket systems w/ "only" 64 cores.
>
> Yes, Frontier is a *single* *CPU* socket and *four GPUs* (actually
> eight GPUs from the user's perspective). It works out to eight cores
> per Graphics Compute Die (GCD). The FLOPS ratio is roughly 1:100
> between the CPU and GPUs.
>
> Note, Aurora is a dual CPU and six GPU. I am not sure if the user sees
> six or more GPUs. The Aurora node is similar to our Summit node but
> with more connectivity between the GPUs.
Thanks for clarfying! I thought it was a single-CPU system like
Frontier. Not only is the FLOPS ratio much higher on GPUs, so if the
FLOPS/W ratio. Even though CPUs have gotten much more efficient lately,
it's practically stagnant compared to GPU-based clusters, based on my
analysis of the Top500 and Green500 trends.
Prentice
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20240122/0b624f21/attachment.htm>
More information about the Beowulf
mailing list