[Beowulf] RoCE vs. InfiniBand

Stu Midgley sdm900 at gmail.com
Mon Jan 18 01:06:32 UTC 2021


Morning (Hi Gilad)

We run RoCE over Mellanox 100G Ethernet and get 1.3us latency for the
shortest hop.  Increasing slightly as you go through the fabric.

We run ethernet for a full dual-plane fat-tree :)  It is 100% possible with
Mellanox :)

We love it.


On Fri, Jan 15, 2021 at 8:40 PM Jörg Saßmannshausen <
sassy-work at sassy.formativ.net> wrote:

> Hi Gilad,
>
> thanks for the feedback, much appreciated.
> In an ideal world, you are right of course. OpenStack is supported
> natively on
> InfiniBand, and you can get the MetroX system to connect between two
> different
> sites (I leave it open of how to read that) etc.
>
> However, in the real world all of that needs to fit into a budget. From
> what I
> can see on the cluster, most jobs are in the region between 64 and 128
> cores.
> So, that raises the question for that rather small amount of cores, do we
> really need InfiniBand or can we do what we need to do with RoCE v2?
>
> In other words, for the same budget, does it make sense to remove the
> InfiniBand part of the design and get say one GPU box in instead?
>
> What I want to avoid is to make the wrong decision (cheap and cheerful)
> and
> ending up with a badly designed cluster later.
>
> As you mentioned MetroX: remind me please, what kind of cable does it
> need? Is
> that something special or can we use already existing cables, whatever is
> used
> between data centre sites (sic!)?
>
> We had a chat with Darren about that which was, as always talking to your
> colleague Darren, very helpful. I remember very distinct there was a
> reason
> why we went for the InfiniBand/RoCE solution but I cannot really remember
> it.
> It was something with the GPU boxes we want to buy as well.
>
> I will pass your comments on to my colleague next week when I am back at
> work
> and see what they say. So many thanks for your sentiments here which are
> much
> appreciated from me!
>
> All the best from a cold London
>
> Jörg
>
> Am Donnerstag, 26. November 2020, 12:51:55 GMT schrieb Gilad Shainer:
> > Let me try to help:
> >
> > -          OpenStack is supported natively on InfiniBand already,
> therefore
> > there is no need to go to Ethernet for that
>
> > -          File system wise, you can have IB file system, and connect
> > directly to IB system.
>
> > -          Depends on the distance, you can run 2Km IB between switches,
> or
> > use Mellanox MetroX for connecting over 40Km. VicinityIO have system that
> > go over thousands of miles…
>
> > -          IB advantages are with much lower latency (switches alone are
> 3X
> > lower latency), cost effectiveness (for the same speed, IB switches are
> > more cost effective than Ethernet) and the In-Network Computing engines
> > (MPI reduction operations, Tag Matching run on the network)
>
> > If you need help, feel free to contact directly.
> >
> > Regards,
> > Gilad Shainer
> >
> > From: Beowulf [mailto:beowulf-bounces at beowulf.org] On Behalf Of John
> Hearns
> > Sent: Thursday, November 26, 2020 3:42 AM
> > To: Jörg Saßmannshausen <sassy-work at sassy.formativ.net>; Beowulf Mailing
> > List <beowulf at beowulf.org>
>  Subject: Re: [Beowulf] RoCE vs. InfiniBand
> >
> > External email: Use caution opening links or attachments
> >
> > Jorg, I think I might know where the Lustre storage is !
> > It is possible to install storage routers, so you could route between
> > ethernet and infiniband.
>  It is also worth saying that Mellanox have Metro
> > Infiniband switches - though I do not think they go as far as the west of
> > London!
> > Seriously though , you ask about RoCE. I will stick my neck out and say
> yes,
> > if you are planning an Openstack cluster
>  with the intention of having
> > mixed AI and 'traditional' HPC workloads I would go for a RoCE style
> setup.
> > In fact I am on a discussion about a new project for a customer with
> > similar aims in an hours time.
> > I could get some benchmarking time if you want to do a direct comparison
> of
> > Gromacs on IB / RoCE
>
> >
> >
> >
> >
> >
> >
> >
> >
> > On Thu, 26 Nov 2020 at 11:14, Jörg Saßmannshausen
> > <sassy-work at sassy.formativ.net<mailto:sassy-work at sassy.formativ.net>>
> > wrote:
>  Dear all,
> >
> > as the DNS problems have been solve (many thanks for doing this!), I was
> > wondering if people on the list have some experiences with this question:
> >
> > We are currently in the process to purchase a new cluster and we want to
> > use
>  OpenStack for the whole management of the cluster. Part of the cluster
> > will run HPC applications like GROMACS for example, other parts typical
> > OpenStack applications like VM. We also are implementing a Data Safe
> Haven
> > for the more sensitive data we are aiming to process. Of course, we want
> to
> > have a decent size GPU partition as well!
> >
> > Now, traditionally I would say that we are going for InfiniBand. However,
> > for
>  reasons I don't want to go into right now, our existing file storage
> > (Lustre) will be in a different location. Thus, we decided to go for RoCE
> > for the file storage and InfiniBand for the HPC applications.
> >
> > The point I am struggling is to understand if this is really the best of
> > the
>  solution or given that we are not building a 100k node cluster, we
> > could use RoCE for the few nodes which are doing parallel, read MPI, jobs
> > too. I have a nagging feeling that I am missing something if we are
> moving
> > to pure RoCE and ditch the InfiniBand. We got a mixed workload, from
> ML/AI
> > to MPI applications like GROMACS to pipelines like they are used in the
> > bioinformatic corner. We are not planning to partition the GPUs, the
> > current design model is to have only 2 GPUs in a chassis.
> > So, is there something I am missing or is the stomach feeling I have
> really
> > a
>  lust for some sushi? :-)
> >
> > Thanks for your sentiments here, much welcome!
> >
> > All the best from a dull London
> >
> > Jörg
> >
> >
> >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org<mailto:Beowulf at beowulf.org>
> > sponsored by Penguin Computing
>  To change your subscription (digest mode or
> > unsubscribe) visit
> > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf<
> https://nam11.safelink
> >
> s.protection.outlook.com/?url=https%3A%2F%2Fbeowulf.org%2Fcgi-bin%2Fmailman%
> > 2Flistinfo%2Fbeowulf&data=04%7C01%7CShainer%40nvidia.com
> %7C8e220b6be2fa48921
> >
> dce08d892005b27%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637419877513157
> >
> 960%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1h
> >
> aWwiLCJXVCI6Mn0%3D%7C1000&sdata=0NLRDQHkYol82mmqs%2BQrFryEuitIpDss2NwgIeyg1K
> > 8%3D&reserved=0>
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>


-- 
Dr Stuart Midgley
sdm900 at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210118/a4510015/attachment.htm>


More information about the Beowulf mailing list