<div dir="ltr">Morning<div><br></div><div>I strongly suggest you get Mellanox to come in help with the initial config.  Their technical teams are great and they know what they are doing.</div><div><br></div><div>We run OpenMPI with UCX on top of Mellanox Multi-Host ethernet network.  </div><div><br></div><div>The setup required a few parameters on each switch and away we went.</div><div><br></div><div>At least with RoCE you can have ALL your equipment on the 1 network (storage, desktops, cluster etc).  You don't need to have dual setups for storage (so you can access it via infiniband for the cluster and ethernet for other devices).</div><div><br></div><div>We run a >10k node cluster with a full L3 RoCE network and it performs wonderfully and reliably.</div><div><br></div><div><br></div>On the switches we do<br><br><blockquote style="margin:0 0 0 40px;border:none;padding:0px"><font face="monospace">roce lossy<br>interface ethernet * qos trust both<br>interface ethernet * traffic-class 3 congestion-control ecn minimum-relative 75 maximum-relative 95</font></blockquote><br><br>and on the hosts (connectX5) we do<br><br><font face="monospace">           /bin/mlnx_qos -i "$eth" --trust dscp<br>           /bin/mlnx_qos -i "$eth" --prio2buffer=0,0,0,0,0,0,0,0<br>           /bin/mlnx_qos -i "$eth" --pfc 0,0,0,0,0,0,0,0<br>           /bin/mlnx_qos -i "$eth" --buffer_size=524160,0,0,0,0,0,0,0<br>           # mlxreg requires the existence of /etc/mft/mft.conf (even though we don't need it)<br>           mft_config="/etc/mft/mft.conf"<br>           mkdir -p "${mft_config%/*}" && touch "$mft_config"<br>           source /sys/class/net/${eth}/device/uevent<br>           mlxreg -d $PCI_SLOT_NAME -reg_name ROCE_ACCL --set "roce_adp_retrans_en=0x1,roce_tx_window_en=0x1,roce_slow_restart_en=0x0" --yes<br>           echo 1 > /proc/sys/net/ipv4/tcp_ecn</font><div><br></div><div><br></div><div>Stu.</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Sep 30, 2021 at 5:26 AM Lohit Valleru <<a href="mailto:lohitv@gwmail.gwu.edu">lohitv@gwmail.gwu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">Hello Everyone..<div><br>I am now at a similar confusion between what to choose for a new cluster - ROCE vs Infiniband.</div><div>My experience with ROCE when I tried it recently was that it was not easy to set it up. It required me to set up qos for lossless fabric, and pfc for flow control. On top of that - it required me to decide how much buffer to dedicate for each class.</div><div>I am not a networking expert, and it has been very difficult for me to understand how much of a buffer would be enough, in addition to how to monitor buffers, and understand that I would need to dedicate more buffers.<br></div><div><div>Following a few mellanox documents - I think i did enable ROCE however i am not sure if i had set it up the right way. Because I was not able to make it reliably work with GPFS or any MPI applications.</div></div><div><br></div><div>In the past, when I worked with Infiniband, it was a breeze to set it up, and make it work with GPFS and other MPI applications. I did have issues with Infiniband errors, which were not easy to debug - but other than that, management and setup of infiniband seemed to be very easy. </div><div><br></div><div>Currently - We have a lot of ethernet networking issues, where we see many discards and retransmits of packets, leading to GPFS complaining about the same and remounting the filesystem or expelling the filesystem. In addition, I see IO performance issues.</div><div>The reason I was told that it might be because of a low buffer 100Gb switch, and a deep buffer 100Gb switch might solve the issue.</div><div>However - I could not prove the same with respect to buffers. </div><div>We enabled ECN and it seemed to make the network a bit stable but the performance is still lacking.</div><div>Most of the issues were because of IO between Storage and Clients, where multiple Storage servers are able to give out a lot more network bandwidth than a single client could take.</div><div><br></div><div>So I was thinking that a better solution, with least setup issues and debugging would be to have both ethernet and infiniband. Ethernet for administrative traffic and Infiniband for data traffic.</div><div>However, the argument is to why not use ROCE instead of infiniband.</div><div><br></div><div>When it comes to ROCE - I find it very difficult to find documentation on how to set things up the correct way and debug issues with buffering/flow control.</div><div>I do see that Mellanox has some documentation with respect to ROCE, but it has not been very clear. </div><div><br></div><div>I have understood that ROCE would mostly be beneficial when it comes to long distances, or when we might need to route between ethernet and infiniband.</div><div><br></div><div><div>May I know if anyone could help me understand with their experience, on what they would choose if they build a new cluster, and why would that be.</div><div>Would ROCE be easier to setup,manage and debug or Infiniband?</div><div><br></div></div><div>As of now - the new cluster is going to be within a single data center, and it might span to about 500 nodes with 4 GPUs and 64 cores each.</div><div>We might get storages ( with multiple storage servers containing from 2 - 6 ConnectX6 per server) that can do 420GB/s or more, and clients with either a single ConnectX6 - 100G or 8 ConnectX6 cards.</div><div><br></div><div>For ROCE - May i know if anyone could help me point to the respective documentation that could help me learn on how to set it up and debug it correctly.</div><div><br></div><div>Thank you,<br>Lohit</div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Jan 17, 2021 at 8:07 PM Stu Midgley <<a href="mailto:sdm900@gmail.com" target="_blank">sdm900@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><div dir="ltr">Morning (Hi Gilad)<div><br></div><div>We run RoCE over Mellanox 100G Ethernet and get 1.3us latency for the shortest hop.  Increasing slightly as you go through the fabric.</div><div><br></div><div>We run ethernet for a full dual-plane fat-tree :)  It is 100% possible with Mellanox :)</div><div><br></div><div>We love it.</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jan 15, 2021 at 8:40 PM Jörg Saßmannshausen <<a href="mailto:sassy-work@sassy.formativ.net" target="_blank">sassy-work@sassy.formativ.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">Hi Gilad,<br>

<br>

thanks for the feedback, much appreciated. <br>

In an ideal world, you are right of course. OpenStack is supported natively on <br>

InfiniBand, and you can get the MetroX system to connect between two different <br>

sites (I leave it open of how to read that) etc. <br>

<br>

However, in the real world all of that needs to fit into a budget. From what I <br>

can see on the cluster, most jobs are in the region between 64 and 128 cores. <br>

So, that raises the question for that rather small amount of cores, do we <br>

really need InfiniBand or can we do what we need to do with RoCE v2?<br>

<br>

In other words, for the same budget, does it make sense to remove the <br>

InfiniBand part of the design and get say one GPU box in instead?<br>

<br>

What I want to avoid is to make the wrong decision (cheap and cheerful) and <br>

ending up with a badly designed cluster later. <br>

<br>

As you mentioned MetroX: remind me please, what kind of cable does it need? Is <br>

that something special or can we use already existing cables, whatever is used <br>

between data centre sites (sic!)?<br>

<br>

We had a chat with Darren about that which was, as always talking to your <br>

colleague Darren, very helpful. I remember very distinct there was a reason <br>

why we went for the InfiniBand/RoCE solution but I cannot really remember it. <br>

It was something with the GPU boxes we want to buy as well. <br>

<br>

I will pass your comments on to my colleague next week when I am back at work <br>

and see what they say. So many thanks for your sentiments here which are much <br>

appreciated from me!<br>

<br>

All the best from a cold London<br>

<br>

Jörg<br>

<br>

Am Donnerstag, 26. November 2020, 12:51:55 GMT schrieb Gilad Shainer:<br>

> Let me try to help:<br>

> <br>

> -          OpenStack is supported natively on InfiniBand already, therefore<br>

> there is no need to go to Ethernet for that<br>

<br>

> -          File system wise, you can have IB file system, and connect<br>

> directly to IB system.<br>

<br>

> -          Depends on the distance, you can run 2Km IB between switches, or<br>

> use Mellanox MetroX for connecting over 40Km. VicinityIO have system that<br>

> go over thousands of miles…<br>

<br>

> -          IB advantages are with much lower latency (switches alone are 3X<br>

> lower latency), cost effectiveness (for the same speed, IB switches are<br>

> more cost effective than Ethernet) and the In-Network Computing engines<br>

> (MPI reduction operations, Tag Matching run on the network)<br>

<br>

> If you need help, feel free to contact directly.<br>

> <br>

> Regards,<br>

> Gilad Shainer<br>

> <br>

> From: Beowulf [mailto:<a href="mailto:beowulf-bounces@beowulf.org" target="_blank">beowulf-bounces@beowulf.org</a>] On Behalf Of John Hearns<br>

> Sent: Thursday, November 26, 2020 3:42 AM<br>

> To: Jörg Saßmannshausen <<a href="mailto:sassy-work@sassy.formativ.net" target="_blank">sassy-work@sassy.formativ.net</a>>; Beowulf Mailing<br>

> List <<a href="mailto:beowulf@beowulf.org" target="_blank">beowulf@beowulf.org</a>><br>

 Subject: Re: [Beowulf] RoCE vs. InfiniBand<br>

> <br>

> External email: Use caution opening links or attachments<br>

> <br>

> Jorg, I think I might know where the Lustre storage is !<br>

> It is possible to install storage routers, so you could route between<br>

> ethernet and infiniband.<br>

 It is also worth saying that Mellanox have Metro<br>

> Infiniband switches - though I do not think they go as far as the west of<br>

> London! <br>

> Seriously though , you ask about RoCE. I will stick my neck out and say yes,<br>

> if you are planning an Openstack cluster<br>

 with the intention of having<br>

> mixed AI and 'traditional' HPC workloads I would go for a RoCE style setup.<br>

> In fact I am on a discussion about a new project for a customer with<br>

> similar aims in an hours time. <br>

> I could get some benchmarking time if you want to do a direct comparison of<br>

> Gromacs on IB / RoCE<br>

<br>

> <br>

> <br>

> <br>

> <br>

> <br>

> <br>

> <br>

> <br>

> On Thu, 26 Nov 2020 at 11:14, Jörg Saßmannshausen<br>

> <<a href="mailto:sassy-work@sassy.formativ.net" target="_blank">sassy-work@sassy.formativ.net</a><mailto:<a href="mailto:sassy-work@sassy.formativ.net" target="_blank">sassy-work@sassy.formativ.net</a>>><br>

> wrote:<br>

 Dear all,<br>

> <br>

> as the DNS problems have been solve (many thanks for doing this!), I was<br>

> wondering if people on the list have some experiences with this question:<br>

> <br>

> We are currently in the process to purchase a new cluster and we want to<br>

> use<br>

 OpenStack for the whole management of the cluster. Part of the cluster<br>

> will run HPC applications like GROMACS for example, other parts typical<br>

> OpenStack applications like VM. We also are implementing a Data Safe Haven<br>

> for the more sensitive data we are aiming to process. Of course, we want to<br>

> have a decent size GPU partition as well!<br>

> <br>

> Now, traditionally I would say that we are going for InfiniBand. However,<br>

> for<br>

 reasons I don't want to go into right now, our existing file storage<br>

> (Lustre) will be in a different location. Thus, we decided to go for RoCE<br>

> for the file storage and InfiniBand for the HPC applications.<br>

> <br>

> The point I am struggling is to understand if this is really the best of<br>

> the<br>

 solution or given that we are not building a 100k node cluster, we<br>

> could use RoCE for the few nodes which are doing parallel, read MPI, jobs<br>

> too. I have a nagging feeling that I am missing something if we are moving<br>

> to pure RoCE and ditch the InfiniBand. We got a mixed workload, from ML/AI<br>

> to MPI applications like GROMACS to pipelines like they are used in the<br>

> bioinformatic corner. We are not planning to partition the GPUs, the<br>

> current design model is to have only 2 GPUs in a chassis.<br>

> So, is there something I am missing or is the stomach feeling I have really<br>

> a<br>

 lust for some sushi? :-)<br>

> <br>

> Thanks for your sentiments here, much welcome!<br>

> <br>

> All the best from a dull London<br>

> <br>

> Jörg<br>

> <br>

> <br>

> <br>

> _______________________________________________<br>

> Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a><mailto:<a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a>><br>

> sponsored by Penguin Computing<br>

 To change your subscription (digest mode or<br>

> unsubscribe) visit<br>

> <a href="https://beowulf.org/cgi-bin/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">https://beowulf.org/cgi-bin/mailman/listinfo/beowulf</a><<a href="https://nam11.safelink" rel="noreferrer" target="_blank">https://nam11.safelink</a><br>

> <a href="http://s.protection.outlook.com/?url=https%3A%2F%2Fbeowulf.org%2Fcgi-bin%2Fmailman%" rel="noreferrer" target="_blank">s.protection.outlook.com/?url=https%3A%2F%2Fbeowulf.org%2Fcgi-bin%2Fmailman%</a><br>

> 2Flistinfo%2Fbeowulf&data=04%7C01%7CShainer%<a href="http://40nvidia.com" rel="noreferrer" target="_blank">40nvidia.com</a>%7C8e220b6be2fa48921<br>

> dce08d892005b27%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637419877513157<br>

> 960%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1h<br>

> aWwiLCJXVCI6Mn0%3D%7C1000&sdata=0NLRDQHkYol82mmqs%2BQrFryEuitIpDss2NwgIeyg1K<br>

> 8%3D&reserved=0><br>

<br>

<br>

<br>

_______________________________________________<br>

Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>

To change your subscription (digest mode or unsubscribe) visit <a href="https://beowulf.org/cgi-bin/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">https://beowulf.org/cgi-bin/mailman/listinfo/beowulf</a><br>

</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr"><div dir="ltr">Dr Stuart Midgley<br><a href="mailto:sdm900@gmail.com" target="_blank">sdm900@gmail.com</a></div></div>

_______________________________________________<br>

Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>

To change your subscription (digest mode or unsubscribe) visit <a href="https://beowulf.org/cgi-bin/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">https://beowulf.org/cgi-bin/mailman/listinfo/beowulf</a><br>

</blockquote></div>

</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr">Dr Stuart Midgley<br><a href="mailto:sdm900@gmail.com" target="_blank">sdm900@gmail.com</a></div></div>