[Beowulf] [External] anyone have modern interconnect metrics?
Prentice Bisbal
pbisbal at pppl.gov
Sat Jan 20 02:39:58 UTC 2024
> Also ease of use with open-source products like OpenMPI,
I don't see this being an issue. OpenMPI will detect your different
interconnects and start with the fastest and work its way to the
slowest. OpenMPI has always "just worked" for me, regardless of the
network. The only issue is if it doesn't find IB, it will issue a
warning that it's using Ethernet and that's probably not what you want,
but that's easy to turn off in the central config file.
> But vendors seem to think that high-end ethernet (100-400Gb) is
> competitive...
Call me a cynical old codger, but I would not be surprised if that's
more profitable for them, or they have other incentives to promote
Ethernet instead of IB. Or if you prefer Hanlon's razor, maybe they just
don't know squat about IB so are selling you what they know.
>
> Yes, someone is sure to say "don't try characterizing all that stuff -
> it's your application's performance that matters!" Alas, we're a generic
> "any kind of research computing" organization, so there are thousands
> of apps
> across all possible domains.
<rant>
I agree with you. I've always hated the "it depends on your application"
stock response in HPC. I think it's BS. Very few of us work in an
environment where we support only a handful of applications with very
similar characteristics. I say use standardized benchmarks that test
specific performance metrics (mem bandwidth or mem latency, etc.),
first, and then use a few applications to confirm what you're seeing
with those benchmarks.
</rant>
> Another interesting topic is that nodes are becoming many-core - any
> thoughts?
Core counts are getting too high to be of use in HPC. High core-count
processors sound great until you realize that all those cores are now
competing for same memory bandwidth and network bandwidth, neither of
which increase with core-count.
Last April we were evaluating test systems from different vendors for a
cluster purchase. One of our test users does a lot of CFD simulations
that are very sensitive to mem bandwidth. While he was getting a 50%
speed up in AMD compared to Intel (which makes sense since AMDs require
12 DIMM slots to be filled instead of Intel's 8), he asked us consider
servers with LESS cores. Even with the AMDs, he was saturating the
memory bandwidth before scaling to all the cores, causing his
performance to plateau. For him, buying cheaper processors with lower
core-counts was better for him, since the savings would allow us to by
additional nodes, which would be more beneficial to him.
>
> Alternatively, are there other places to ask? Reddit or something less
> "greybeard"?
I've been very disappointed with the "expertise" on the HPC-related
subreddits. Last time I lurked there, it seemed very amateurish/DIY
oriented. For example, someone wanted to buy all the individual
components and build assemble their own nodes for an entire cluster at
their job. Can you imagine? Most of the replies were encouraging them to
do so....
You might want to join the HPCSYSPROS Slack channel and ask there. HPC
SYSPROS is an ACM SIG for HPC system admins that runs workshops every
year at SC. click on the "Get Involved" link on this page:
https://sighpc-syspros.org/
--
Prentice
On 1/16/24 5:19 PM, Mark Hahn wrote:
> Hi all,
> Just wondering if any of you have numbers (or experience) with
> modern high-speed COTS ethernet.
>
> Latency mainly, but perhaps also message rate. Also ease of use
> with open-source products like OpenMPI, maybe Lustre?
> Flexibility in configuring clusters in the >= 1k node range?
>
> We have a good idea of what to expect from Infiniband offerings,
> and are familiar with scalable network topologies.
> But vendors seem to think that high-end ethernet (100-400Gb) is
> competitive...
>
> For instance, here's an excellent study of Cray/HP Slingshot (non-COTS):
> https://arxiv.org/pdf/2008.08886.pdf
> (half rtt around 2 us, but this paper has great stuff about
> congestion, etc)
>
> Yes, someone is sure to say "don't try characterizing all that stuff -
> it's your application's performance that matters!" Alas, we're a generic
> "any kind of research computing" organization, so there are thousands
> of apps
> across all possible domains.
>
> Another interesting topic is that nodes are becoming many-core - any
> thoughts?
>
> Alternatively, are there other places to ask? Reddit or something less
> "greybeard"?
>
> thanks, mark hahn
> McMaster U / SharcNET / ComputeOntario / DRI Alliance Canada
>
> PS: the snarky name "NVidiband" just occurred to me; too soon?
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
More information about the Beowulf
mailing list