[Beowulf] How Would You Test Infiniband in New Cluster?
Bill Broadley
bill at cse.ucdavis.edu
Tue Nov 17 14:46:43 PST 2009
Jon Forrest wrote:
> Bill Broadley wrote:
>
>> My first suggest sanity test would be to test latency and bandwidth to
>> insure
>> you are getting IB numbers. So 80-100MB/sec and 30-60us for a small
>> packet
>> would imply GigE. 6-8 times the bandwidth certainly would imply SDR or
>> better. Latency varies quite a bit among implementation, I'd try to get
>> within 30-40% of advertised latency numbers.
>
> For those of us who aren't familiar with IB utilities,
> could you give some examples of the commands you'd use
> to do this?
>
> Thanks,
> Jon
Here's 2 that I use:
http://cse.ucdavis.edu/bill/relay.c
http://cse.ucdavis.edu/bill/mpi_nxnlatbw.c
So to compile, assuming a sane environment:
mpicc -O3 relay.c -o relay
The command to run an MPI program varies by environment and mpi
implementation, and batch queue environment (especially tight integration).
It should be something close to:
mpirun -np <number of nodes> -machinefile <list of nodes> ./relay 1
mpirun -np <number of nodes> -machinefile <list of nodes> ./relay 1024
mpirun -np <number of nodes> -machinefile <list of nodes> ./relay 8192
You should see something like:
c0-8 c0-22
size= 1, 16384 hops, 2 nodes in 0.75 sec ( 45.97 us/hop) 85 KB/sec
c0-8 c0-22
size= 1024, 16384 hops, 2 nodes in 2.00 sec (121.94 us/hop) 32803 KB/sec
c0-8 c0-22
size= 8192, 16384 hops, 2 nodes in 6.21 sec (379.05 us/hop) 84421 KB/sec
So basically on a tiny packet 45us of latency (normal for gigE), and on a
large package 84MB/sec or so (normal for GigE).
I'd start with 2 nodes, then if you are happy try it with all nodes.
Now for infiniband you should see something like:
c0-5 c0-4
size= 1, 16384 hops, 2 nodes in 0.03 sec ( 1.72 us/hop) 2274 KB/sec
c0-5 c0-4
size= 1024, 16384 hops, 2 nodes in 0.16 sec ( 9.92 us/hop) 403324 KB/sec
c0-5 c0-4
size= 8192, 16384 hops, 2 nodes in 0.50 sec ( 30.34 us/hop) 1054606 KB/sec
Note the latency is some 25 times less and the bandwidth some 10+ times
higher. Note the hostnames are different, don't run multiple copies on the
same node unless you intend to. Running 4 copies on a 4 cpu node doesn't test
infiniband.
So once you get what you expect I'd suggest something a bit more
comprehensive. Something like:
mpirun -np <number of nodes> -machinefile <list of nodes> ./mpi_nxnlatbw
I'd expect some different in latency and bandwidth between nodes, but not any
big differences. Something like:
[0<->1] 1.85us 1398.825264 (MillionBytes/sec)
[0<->2] 1.75us 1300.812337 (MillionBytes/sec)
[0<->3] 1.76us 1396.205242 (MillionBytes/sec)
[0<->4] 1.68us 1398.647324 (MillionBytes/sec)
[1<->0] 1.82us 1375.550155 (MillionBytes/sec)
[1<->2] 1.69us 1397.936020 (MillionBytes/sec)
...
Once those numbers are consistent and where you expect them (both latency and
bandwidth) I'd follow up with a production code that produces a known answer
and is likely to provide much wider MPI coverage.
More information about the Beowulf
mailing list