[Beowulf] How to know if infiniband network works?

Jeff Johnson jeff.johnson at aeoncomputing.com
Wed Aug 2 16:29:05 PDT 2017


Faraz,

You can test your point to point rdma bandwidth as well.

On host lustwz99 run `qperf`
On any of the hosts lustwzb1-16 run `qperf lustwz99 -t 30 rc_lat rc_bi_bw`

Establish that you can pass traffic at expected speeds before going to the
ipoib portion.

Also make sure that all of your node are running in the same mode,
connected or datagram and that your MTU is the same on all nodes for that
IP interface.

--Jeff

On Wed, Aug 2, 2017 at 10:50 AM, Faraz Hussain <info at feacluster.com> wrote:

> Thanks Joe. Here is the output from the commands you suggested. We have
> open mpi built from Intel mpi compiler. Is there some benchmark code I can
> compile so that we are all comparing the same code?
>
> [hussaif1 at lustwzb4 test]$ ibv_devinfo
> hca_id: mlx4_0
>         transport:                      InfiniBand (0)
>         fw_ver:                         2.11.550
>         node_guid:                      f452:1403:0016:3b70
>         sys_image_guid:                 f452:1403:0016:3b73
>         vendor_id:                      0x02c9
>         vendor_part_id:                 4099
>         hw_ver:                         0x0
>         board_id:                       DEL0A40000028
>         phys_port_cnt:                  2
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                4096 (5)
>                         active_mtu:             4096 (5)
>                         sm_lid:                 1
>                         port_lid:               3
>                         port_lmc:               0x00
>                         link_layer:             InfiniBand
>
>                 port:   2
>                         state:                  PORT_DOWN (1)
>                         max_mtu:                4096 (5)
>                         active_mtu:             4096 (5)
>                         sm_lid:                 0
>                         port_lid:               0
>                         port_lmc:               0x00
>                         link_layer:             InfiniBand
>
>
> [hussaif1 at lustwzb4 test]$ ibstat
> CA 'mlx4_0'
>         CA type: MT4099
>         Number of ports: 2
>         Firmware version: 2.11.550
>         Hardware version: 0
>         Node GUID: 0xf452140300163b70
>         System image GUID: 0xf452140300163b73
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 40 (FDR10)
>                 Base lid: 3
>                 LMC: 0
>                 SM lid: 1
>                 Capability mask: 0x02514868
>                 Port GUID: 0xf452140300163b71
>                 Link layer: InfiniBand
>         Port 2:
>                 State: Down
>                 Physical state: Disabled
>                 Rate: 10
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x02514868
>                 Port GUID: 0xf452140300163b72
>                 Link layer: InfiniBand
>
> [hussaif1 at lustwzb4 test]$ ibstatus
> Infiniband device 'mlx4_0' port 1 status:
>         default gid:     fe80:0000:0000:0000:f452:1403:0016:3b71
>         base lid:        0x3
>         sm lid:          0x1
>         state:           4: ACTIVE
>         phys state:      5: LinkUp
>         rate:            40 Gb/sec (4X FDR10)
>         link_layer:      InfiniBand
>
> Infiniband device 'mlx4_0' port 2 status:
>         default gid:     fe80:0000:0000:0000:f452:1403:0016:3b72
>         base lid:        0x0
>         sm lid:          0x0
>         state:           1: DOWN
>         phys state:      3: Disabled
>         rate:            10 Gb/sec (4X)
>         link_layer:      InfiniBand
>
>
>
> Quoting Joe Landman <joe.landman at gmail.com>:
>
> start with
>>
>>     ibv_devinfo
>>
>>     ibstat
>>
>>     ibstatus
>>
>>
>> and see what (if anything) they report.
>>
>> Second, how did you compile/run your MPI code?
>>
>>
>> On 08/02/2017 12:44 PM, Faraz Hussain wrote:
>>
>>> I have inherited a 20-node cluster that supposedly has an infiniband
>>> network. I am testing some mpi applications and am seeing no performance
>>> improvement with multiple nodes. So I am wondering if the Infiband network
>>> even works?
>>>
>>> The output of ifconfig -a shows an ib0 and ib1 network. I ran ethtools
>>> ib0 and it shows:
>>>
>>>        Speed: 40000Mb/s
>>>        Link detected: no
>>>
>>> and for ib1 it show:
>>>
>>>        Speed: 10000Mb/s
>>>        Link detected: no
>>>
>>> I am assuming this means it is down? Any idea how to debug further and
>>> restart it?
>>>
>>> Thanks!
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>
>> --
>> Joe Landman
>> e: joe.landman at gmail.com
>> t: @hpcjoe
>> w: https://scalability.org
>> g: https://github.com/joelandman
>> l: https://www.linkedin.com/in/joelandman
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>



-- 
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

jeff.johnson at aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite D - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170802/83e925b3/attachment-0001.html>


More information about the Beowulf mailing list