[Beowulf] [External] anyone have modern interconnect metrics?

Prentice Bisbal pbisbal at pppl.gov
Mon Jan 22 16:54:33 UTC 2024


On 1/22/24 11:38 AM, Scott Atchley wrote:
> On Mon, Jan 22, 2024 at 11:16 AM Prentice Bisbal <pbisbal at pppl.gov> wrote:
>
>>     <snip>
>>
>>         > Another interesting topic is that nodes are becoming
>>         many-core - any
>>         > thoughts?
>>
>>         Core counts are getting too high to be of use in HPC. High
>>         core-count
>>         processors sound great until you realize that all those cores
>>         are now
>>         competing for same memory bandwidth and network bandwidth,
>>         neither of
>>         which increase with core-count.
>>
>>         Last April we were evaluating test systems from different
>>         vendors for a
>>         cluster purchase. One of our test users does a lot of CFD
>>         simulations
>>         that are very sensitive to mem bandwidth. While he was
>>         getting a 50%
>>         speed up in AMD compared to Intel (which makes sense since
>>         AMDs require
>>         12 DIMM slots to be filled instead of Intel's 8), he asked us
>>         consider
>>         servers with LESS cores. Even with the AMDs, he was
>>         saturating the
>>         memory bandwidth before scaling to all the cores, causing his
>>         performance to plateau. For him, buying cheaper processors
>>         with lower
>>         core-counts was better for him, since the savings would allow
>>         us to by
>>         additional nodes, which would be more beneficial to him.
>>
>>
>>     We see this as well in DOE especially when GPUs are doing a
>>     significant amount of the work.
>
>     Yeah, I noticed that Frontier and Aurora will actually be
>     single-socket systems w/ "only" 64 cores.
>
>  Yes, Frontier is a *single* *CPU* socket and *four GPUs* (actually 
> eight GPUs from the user's perspective). It works out to eight cores 
> per Graphics Compute Die (GCD). The FLOPS ratio is roughly 1:100 
> between the CPU and GPUs.
>
> Note, Aurora is a dual CPU and six GPU. I am not sure if the user sees 
> six or more GPUs. The Aurora node is similar to our Summit node but 
> with more connectivity between the GPUs.

Thanks for clarfying! I thought it was a  single-CPU system like 
Frontier. Not only is the FLOPS ratio much higher on GPUs, so if the 
FLOPS/W ratio. Even though CPUs have gotten much more efficient lately, 
it's practically stagnant compared to GPU-based clusters, based on my 
analysis of the Top500 and Green500 trends.

Prentice
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20240122/0b624f21/attachment.htm>


More information about the Beowulf mailing list