[Beowulf] How Would You Test Infiniband in New Cluster?

Bill Broadley bill at cse.ucdavis.edu
Tue Nov 17 10:33:17 PST 2009

Jon Forrest wrote:
> Let's say you have a brand new cluster with
> brand new Infiniband hardware, and that
> you've installed OFED 1.4 and the
> appropriate drivers for your IB
> HCAs (i.e. you see ib0 devices
> on the frontend and all compute nodes).
> The cluster appears to be working
> fine but you're not sure about IB.
> How would you test your IB network
> to make sure all is well?

My first suggest sanity test would be to test latency and bandwidth to insure
you are getting IB numbers.  So 80-100MB/sec and 30-60us for a small packet
would imply GigE.  6-8 times the bandwidth certainly would imply SDR or
better.  Latency varies quite a bit among implementation, I'd try to get
within 30-40% of advertised latency numbers.

Then I'd try a workload that kept all nodes busy with something communications
intensive.  Pathscale has a mpi_nxnlatbw which works reasonable well to
identify ports/nodes that are are slower than expected.

After that works I'd suggest a production MPI work load with a known answer.

