[Beowulf] RoCE vs. InfiniBand

Jörg Saßmannshausen sassy-work at sassy.formativ.net
Thu Nov 26 11:14:05 UTC 2020


Dear all,

as the DNS problems have been solve (many thanks for doing this!), I was 
wondering if people on the list have some experiences with this question:
 
We are currently in the process to purchase a new cluster and we want to use 
OpenStack for the whole management of the cluster. Part of the cluster will 
run HPC applications like GROMACS for example, other parts typical OpenStack 
applications like VM. We also are implementing a Data Safe Haven for the more 
sensitive data we are aiming to process. Of course, we want to have a decent 
size GPU partition as well!

Now, traditionally I would say that we are going for InfiniBand. However, for 
reasons I don't want to go into right now, our existing file storage (Lustre) 
will be in a different location. Thus, we decided to go for RoCE for the file 
storage and InfiniBand for the HPC applications. 

The point I am struggling is to understand if this is really the best of the 
solution or given that we are not building a 100k node cluster, we could use 
RoCE for the few nodes which are doing parallel, read MPI, jobs too. 
I have a nagging feeling that I am missing something if we are moving to pure 
RoCE and ditch the InfiniBand. We got a mixed workload, from ML/AI to MPI 
applications like GROMACS to pipelines like they are used in the bioinformatic 
corner. We are not planning to partition the GPUs, the current design model is 
to have only 2 GPUs in a chassis. 
So, is there something I am missing or is the stomach feeling I have really a 
lust for some sushi? :-)

Thanks for your sentiments here, much welcome!

All the best from a dull London

Jörg





More information about the Beowulf mailing list