[Beowulf] How Would You Test Infiniband in New Cluster?

Tue Nov 17 14:46:43 PST 2009

Jon Forrest wrote:
> Bill Broadley wrote:
> 
>> My first suggest sanity test would be to test latency and bandwidth to
>> insure
>> you are getting IB numbers.  So 80-100MB/sec and 30-60us for a small
>> packet
>> would imply GigE.  6-8 times the bandwidth certainly would imply SDR or
>> better.  Latency varies quite a bit among implementation, I'd try to get
>> within 30-40% of advertised latency numbers.
> 
> For those of us who aren't familiar with IB utilities,
> could you give some examples of the commands you'd use
> to do this?
> 
> Thanks,
> Jon

Here's 2 that I use:
 http://cse.ucdavis.edu/bill/relay.c
 http://cse.ucdavis.edu/bill/mpi_nxnlatbw.c

So to compile, assuming a sane environment:
mpicc -O3 relay.c -o relay

The command to run an MPI program varies by environment and mpi
implementation, and batch queue environment (especially tight integration).
It should be something close to:
mpirun -np <number of nodes> -machinefile <list of nodes> ./relay 1
mpirun -np <number of nodes> -machinefile <list of nodes> ./relay 1024
mpirun -np <number of nodes> -machinefile <list of nodes> ./relay 8192

You should see something like:
c0-8 c0-22
size=     1,  16384 hops,  2 nodes in   0.75 sec ( 45.97 us/hop)     85 KB/sec
c0-8 c0-22
size=  1024,  16384 hops,  2 nodes in   2.00 sec (121.94 us/hop)  32803 KB/sec
c0-8 c0-22
size=  8192,  16384 hops,  2 nodes in   6.21 sec (379.05 us/hop)  84421 KB/sec

So basically on a tiny packet 45us of latency (normal for gigE), and on a
large package 84MB/sec or so (normal for GigE).

I'd start with 2 nodes, then if you are happy try it with all nodes.

Now for infiniband you should see something like:

c0-5 c0-4
size=     1,  16384 hops,  2 nodes in   0.03 sec (  1.72 us/hop)   2274 KB/sec
c0-5 c0-4
size=  1024,  16384 hops,  2 nodes in   0.16 sec (  9.92 us/hop) 403324 KB/sec
c0-5 c0-4
size=  8192,  16384 hops,  2 nodes in  0.50 sec ( 30.34 us/hop) 1054606 KB/sec

Note the latency is some 25 times less and the bandwidth some 10+ times
higher.  Note the hostnames are different, don't run multiple copies on the
same node unless you intend to.  Running 4 copies on a 4 cpu node doesn't test
infiniband.

So once you get what you expect I'd suggest something a bit more
comprehensive.  Something like:
mpirun -np <number of nodes> -machinefile <list of nodes> ./mpi_nxnlatbw

I'd expect some different in latency and bandwidth between nodes, but not any
big differences.  Something like:
[0<->1]		1.85us		1398.825264 (MillionBytes/sec)
[0<->2]		1.75us		1300.812337 (MillionBytes/sec)
[0<->3]		1.76us		1396.205242 (MillionBytes/sec)
[0<->4]		1.68us		1398.647324 (MillionBytes/sec)
[1<->0]		1.82us		1375.550155 (MillionBytes/sec)
[1<->2]		1.69us		1397.936020 (MillionBytes/sec)
...

Once those numbers are consistent and where you expect them (both latency and
bandwidth) I'd follow up with a production code that produces a known answer
and is likely to provide much wider MPI coverage.