[Beowulf] How to know if infiniband network works?
Faraz Hussain
info at feacluster.com
Thu Aug 3 06:21:16 PDT 2017
I ran the qperf command between two compute nodes ( b4 and b5 ) and got:
[hussaif1 at lustwzb5 ~]$ qperf lustwzb4 -t 30 rc_lat rc_bi_bw
rc_lat:
fd
latency = 7.73 us
rc_bi_bw:
bw = 9.06 GB/sec
If I understand correctly, I would need to enable ipoib and then rerun
test? It would then show ~40GB/sec I assume.
Quoting Jeff Johnson <jeff.johnson at aeoncomputing.com>:
> Faraz,
>
> You can test your point to point rdma bandwidth as well.
>
> On host lustwz99 run `qperf`
> On any of the hosts lustwzb1-16 run `qperf lustwz99 -t 30 rc_lat rc_bi_bw`
>
> Establish that you can pass traffic at expected speeds before going to the
> ipoib portion.
>
> Also make sure that all of your node are running in the same mode,
> connected or datagram and that your MTU is the same on all nodes for that
> IP interface.
>
> --Jeff
>
> On Wed, Aug 2, 2017 at 10:50 AM, Faraz Hussain <info at feacluster.com> wrote:
>
>> Thanks Joe. Here is the output from the commands you suggested. We have
>> open mpi built from Intel mpi compiler. Is there some benchmark code I can
>> compile so that we are all comparing the same code?
>>
>> [hussaif1 at lustwzb4 test]$ ibv_devinfo
>> hca_id: mlx4_0
>> transport: InfiniBand (0)
>> fw_ver: 2.11.550
>> node_guid: f452:1403:0016:3b70
>> sys_image_guid: f452:1403:0016:3b73
>> vendor_id: 0x02c9
>> vendor_part_id: 4099
>> hw_ver: 0x0
>> board_id: DEL0A40000028
>> phys_port_cnt: 2
>> port: 1
>> state: PORT_ACTIVE (4)
>> max_mtu: 4096 (5)
>> active_mtu: 4096 (5)
>> sm_lid: 1
>> port_lid: 3
>> port_lmc: 0x00
>> link_layer: InfiniBand
>>
>> port: 2
>> state: PORT_DOWN (1)
>> max_mtu: 4096 (5)
>> active_mtu: 4096 (5)
>> sm_lid: 0
>> port_lid: 0
>> port_lmc: 0x00
>> link_layer: InfiniBand
>>
>>
>> [hussaif1 at lustwzb4 test]$ ibstat
>> CA 'mlx4_0'
>> CA type: MT4099
>> Number of ports: 2
>> Firmware version: 2.11.550
>> Hardware version: 0
>> Node GUID: 0xf452140300163b70
>> System image GUID: 0xf452140300163b73
>> Port 1:
>> State: Active
>> Physical state: LinkUp
>> Rate: 40 (FDR10)
>> Base lid: 3
>> LMC: 0
>> SM lid: 1
>> Capability mask: 0x02514868
>> Port GUID: 0xf452140300163b71
>> Link layer: InfiniBand
>> Port 2:
>> State: Down
>> Physical state: Disabled
>> Rate: 10
>> Base lid: 0
>> LMC: 0
>> SM lid: 0
>> Capability mask: 0x02514868
>> Port GUID: 0xf452140300163b72
>> Link layer: InfiniBand
>>
>> [hussaif1 at lustwzb4 test]$ ibstatus
>> Infiniband device 'mlx4_0' port 1 status:
>> default gid: fe80:0000:0000:0000:f452:1403:0016:3b71
>> base lid: 0x3
>> sm lid: 0x1
>> state: 4: ACTIVE
>> phys state: 5: LinkUp
>> rate: 40 Gb/sec (4X FDR10)
>> link_layer: InfiniBand
>>
>> Infiniband device 'mlx4_0' port 2 status:
>> default gid: fe80:0000:0000:0000:f452:1403:0016:3b72
>> base lid: 0x0
>> sm lid: 0x0
>> state: 1: DOWN
>> phys state: 3: Disabled
>> rate: 10 Gb/sec (4X)
>> link_layer: InfiniBand
>>
>>
>>
>> Quoting Joe Landman <joe.landman at gmail.com>:
>>
>> start with
>>>
>>> ibv_devinfo
>>>
>>> ibstat
>>>
>>> ibstatus
>>>
>>>
>>> and see what (if anything) they report.
>>>
>>> Second, how did you compile/run your MPI code?
>>>
>>>
>>> On 08/02/2017 12:44 PM, Faraz Hussain wrote:
>>>
>>>> I have inherited a 20-node cluster that supposedly has an infiniband
>>>> network. I am testing some mpi applications and am seeing no performance
>>>> improvement with multiple nodes. So I am wondering if the Infiband network
>>>> even works?
>>>>
>>>> The output of ifconfig -a shows an ib0 and ib1 network. I ran ethtools
>>>> ib0 and it shows:
>>>>
>>>> Speed: 40000Mb/s
>>>> Link detected: no
>>>>
>>>> and for ib1 it show:
>>>>
>>>> Speed: 10000Mb/s
>>>> Link detected: no
>>>>
>>>> I am assuming this means it is down? Any idea how to debug further and
>>>> restart it?
>>>>
>>>> Thanks!
>>>>
>>>> _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>>> To change your subscription (digest mode or unsubscribe) visit
>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>
>>>
>>> --
>>> Joe Landman
>>> e: joe.landman at gmail.com
>>> t: @hpcjoe
>>> w: https://scalability.org
>>> g: https://github.com/joelandman
>>> l: https://www.linkedin.com/in/joelandman
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>
>>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
>
>
> --
> ------------------------------
> Jeff Johnson
> Co-Founder
> Aeon Computing
>
> jeff.johnson at aeoncomputing.com
> www.aeoncomputing.com
> t: 858-412-3810 x1001 f: 858-412-3845
> m: 619-204-9061
>
> 4170 Morena Boulevard, Suite D - San Diego, CA 92117
>
> High-Performance Computing / Lustre Filesystems / Scale-out Storage
More information about the Beowulf
mailing list