[Beowulf] anyone have modern interconnect metrics?

Mark Hahn hahn at mcmaster.ca
Tue Jan 16 22:19:41 UTC 2024


Hi all,
Just wondering if any of you have numbers (or experience) with
modern high-speed COTS ethernet.

Latency mainly, but perhaps also message rate.  Also ease of use
with open-source products like OpenMPI, maybe Lustre?
Flexibility in configuring clusters in the >= 1k node range?

We have a good idea of what to expect from Infiniband offerings,
and are familiar with scalable network topologies.
But vendors seem to think that high-end ethernet (100-400Gb) is competitive...

For instance, here's an excellent study of Cray/HP Slingshot (non-COTS):
https://arxiv.org/pdf/2008.08886.pdf
(half rtt around 2 us, but this paper has great stuff about congestion, etc)

Yes, someone is sure to say "don't try characterizing all that stuff -
it's your application's performance that matters!"  Alas, we're a generic
"any kind of research computing" organization, so there are thousands of apps
across all possible domains.

Another interesting topic is that nodes are becoming many-core - any thoughts?

Alternatively, are there other places to ask? Reddit or something less "greybeard"?

thanks, mark hahn
McMaster U / SharcNET / ComputeOntario / DRI Alliance Canada

PS: the snarky name "NVidiband" just occurred to me; too soon?


More information about the Beowulf mailing list