[Beowulf] [External] anyone have modern interconnect metrics?

Fri Feb 16 18:58:49 UTC 2024

Is web/cloud actually bursty when contemplated at an aggregated multithreaded scale.
Sure, if you're running a single thread on a single CPU, there's dead time between handling requests and feeding data out, and whatever backend processing is done. But if you're handling hundreds of interleaved sessions, I would think that the CPU & memory access load would be pretty constant.  While you're waiting for external disk to complete, you're handling another request, etc.

Or, is the server software so clunkily written that it can't do this?  I can imagine this - it might be cheaper to throw hardware at it than to try and squeeze every last cycle out of the CPU, even at a scale of millions of CPUs. It sure would be easier to debug __ 

On 1/24/24, 11:40 AM, "Beowulf on behalf of Douglas Eadline" <beowulf-bounces at beowulf.org <mailto:beowulf-bounces at beowulf.org> on behalf of deadline at eadline.org <mailto:deadline at eadline.org>> wrote:

--snip--

> Core counts are getting too high to be of use in HPC. High core-count
> processors sound great until you realize that all those cores are now
> competing for same memory bandwidth and network bandwidth, neither of
> which increase with core-count.
>
> Last April we were evaluating test systems from different vendors for a
> cluster purchase. One of our test users does a lot of CFD simulations
> that are very sensitive to mem bandwidth. While he was getting a 50%
> speed up in AMD compared to Intel (which makes sense since AMDs require
> 12 DIMM slots to be filled instead of Intel's 8), he asked us consider
> servers with LESS cores. Even with the AMDs, he was saturating the
> memory bandwidth before scaling to all the cores, causing his
> performance to plateau. For him, buying cheaper processors with lower
> core-counts was better for him, since the savings would allow us to by
> additional nodes, which would be more beneficial to him.
>

So it does depend on the application <ducks>

Also, server processors are mainly designed for cloud use (their
biggest customers) which means large numbers of cores. Besides
memory BW there is also clock speed. Speeds are based on thermals
which are based on how busy the CPUs are. For webby/clould
bursty loads, this works okay you can hit "turbo speeds",
but load the system with one HPC process per core
and you are now running at base frequency with
some crappy memory BW.

--
Doug

>>
>> Alternatively, are there other places to ask? Reddit or something less
>> "greybeard"?
>
> I've been very disappointed with the "expertise" on the HPC-related
> subreddits. Last time I lurked there, it seemed very amateurish/DIY
> oriented. For example, someone wanted to buy all the individual
> components and build assemble their own nodes for an entire cluster at
> their job. Can you imagine? Most of the replies were encouraging them to
> do so....
>
> Â You might want to join the HPCSYSPROS Slack channel and ask there. HPC
> SYSPROS is an ACM SIG for HPC system admins that runs workshops every
> year at SC. click on the "Get Involved" link on this page:
>
> https://urldefense.us/v3/__https://sighpc-syspros.org/__;!!PvBDto6Hs4WbVuu7!O1KGoF9CdhfNtw3IPZ3FRxxgE6iEaJGvIAoHqCx_Xa9pEWJcZd7vkU8-whZjoGiZ6nF0dj58jYCe7pgOk5y8QiBgIg$ <https://urldefense.us/v3/__https://sighpc-syspros.org/__;!!PvBDto6Hs4WbVuu7!O1KGoF9CdhfNtw3IPZ3FRxxgE6iEaJGvIAoHqCx_Xa9pEWJcZd7vkU8-whZjoGiZ6nF0dj58jYCe7pgOk5y8QiBgIg$> 
>
> --
> Prentice
>
>
> On 1/16/24 5:19 PM, Mark Hahn wrote:
>> Hi all,
>> Just wondering if any of you have numbers (or experience) with
>> modern high-speed COTS ethernet.
>>
>> Latency mainly, but perhaps also message rate.Â Also ease of use
>> with open-source products like OpenMPI, maybe Lustre?
>> Flexibility in configuring clusters in the >= 1k node range?
>>
>> We have a good idea of what to expect from Infiniband offerings,
>> and are familiar with scalable network topologies.
>> But vendors seem to think that high-end ethernet (100-400Gb) is
>> competitive...
>>
>> For instance, here's an excellent study of Cray/HP Slingshot (non-COTS):
>> https://urldefense.us/v3/__https://arxiv.org/pdf/2008.08886.pdf__;!!PvBDto6Hs4WbVuu7!O1KGoF9CdhfNtw3IPZ3FRxxgE6iEaJGvIAoHqCx_Xa9pEWJcZd7vkU8-whZjoGiZ6nF0dj58jYCe7pgOk5wf27irBg$ <https://urldefense.us/v3/__https://arxiv.org/pdf/2008.08886.pdf__;!!PvBDto6Hs4WbVuu7!O1KGoF9CdhfNtw3IPZ3FRxxgE6iEaJGvIAoHqCx_Xa9pEWJcZd7vkU8-whZjoGiZ6nF0dj58jYCe7pgOk5wf27irBg$> 
>> (half rtt around 2 us, but this paper has great stuff about
>> congestion, etc)
>>
>> Yes, someone is sure to say "don't try characterizing all that stuff -
>> it's your application's performance that matters!"Â Alas, we're a
>> generic
>> "any kind of research computing" organization, so there are thousands
>> of apps
>> across all possible domains.
>>
>> Another interesting topic is that nodes are becoming many-core - any
>> thoughts?
>>
>> Alternatively, are there other places to ask? Reddit or something less
>> "greybeard"?
>>
>> thanks, mark hahn
>> McMaster U / SharcNET / ComputeOntario / DRI Alliance Canada
>>
>> PS: the snarky name "NVidiband" just occurred to me; too soon?
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> https://urldefense.us/v3/__https://beowulf.org/cgi-bin/mailman/listinfo/beowulf__;!!PvBDto6Hs4WbVuu7!O1KGoF9CdhfNtw3IPZ3FRxxgE6iEaJGvIAoHqCx_Xa9pEWJcZd7vkU8-whZjoGiZ6nF0dj58jYCe7pgOk5z2pOLZOg$ <https://urldefense.us/v3/__https://beowulf.org/cgi-bin/mailman/listinfo/beowulf__;!!PvBDto6Hs4WbVuu7!O1KGoF9CdhfNtw3IPZ3FRxxgE6iEaJGvIAoHqCx_Xa9pEWJcZd7vkU8-whZjoGiZ6nF0dj58jYCe7pgOk5z2pOLZOg$> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://urldefense.us/v3/__https://beowulf.org/cgi-bin/mailman/listinfo/beowulf__;!!PvBDto6Hs4WbVuu7!O1KGoF9CdhfNtw3IPZ3FRxxgE6iEaJGvIAoHqCx_Xa9pEWJcZd7vkU8-whZjoGiZ6nF0dj58jYCe7pgOk5z2pOLZOg$ <https://urldefense.us/v3/__https://beowulf.org/cgi-bin/mailman/listinfo/beowulf__;!!PvBDto6Hs4WbVuu7!O1KGoF9CdhfNtw3IPZ3FRxxgE6iEaJGvIAoHqCx_Xa9pEWJcZd7vkU8-whZjoGiZ6nF0dj58jYCe7pgOk5z2pOLZOg$> 
>

-- 
Doug

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit https://urldefense.us/v3/__https://beowulf.org/cgi-bin/mailman/listinfo/beowulf__;!!PvBDto6Hs4WbVuu7!O1KGoF9CdhfNtw3IPZ3FRxxgE6iEaJGvIAoHqCx_Xa9pEWJcZd7vkU8-whZjoGiZ6nF0dj58jYCe7pgOk5z2pOLZOg$ <https://urldefense.us/v3/__https://beowulf.org/cgi-bin/mailman/listinfo/beowulf__;!!PvBDto6Hs4WbVuu7!O1KGoF9CdhfNtw3IPZ3FRxxgE6iEaJGvIAoHqCx_Xa9pEWJcZd7vkU8-whZjoGiZ6nF0dj58jYCe7pgOk5z2pOLZOg$>