[Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

Faraz Hussain info at feacluster.com
Wed May 1 07:05:41 PDT 2019


> What hardware and what Infiniband switch you have
> Run   these commands:      ibdiagnet   smshow

Unfortunately ibdiagnet seems to give some errors:

[hussaif1 at lustwzb34 ~]$ ibdiagnet
----------
Load Plugins from:
/usr/share/ibdiagnet2.1.1/plugins/
(You can specify more paths to be looked in with  
"IBDIAGNET_PLUGINS_PATH" env variable)

Plugin Name                                   Result     Comment
libibdiagnet_cable_diag_plugin-2.1.1          Succeeded  Plugin loaded
libibdiagnet_phy_diag_plugin-2.1.1            Succeeded  Plugin loaded

---------------------------------------------
Discovery
-E- Failed to initialize

-E- Fabric Discover failed, err=IBDiag initialize wasn't done
-E- Fabric Discover failed, MAD err=Failed to umad_open_port

---------------------------------------------
Summary
-I- Stage                     Warnings   Errors     Comment
-I- Discovery                                       NA
-I- Lids Check                                      NA
-I- Links Check                                     NA
-I- Subnet Manager                                  NA
-I- Port Counters                                   NA
-I- Nodes Information                               NA
-I- Speed / Width checks                            NA
-I- Partition Keys                                  NA
-I- Alias GUIDs                                     NA
-I- Temperature Sensing                             NA

-I- You can find detailed errors/warnings in:  
/var/tmp/ibdiagnet2/ibdiagnet2.log

-E- A fatal error occurred, exiting...


I do not have smshow command , but I see there is an sminfo. It also  
give this error:

[hussaif1 at lustwzb34 ~]$ smshow
bash: smshow: command not found...
[hussaif1 at lustwzb34 ~]$ sm
smartctl     smbcacls     smbcquotas   smbspool     smbtree       
sm-notify    smpdump      smtp-sink
smartd       smbclient    smbget       smbtar       sminfo        
smparquery   smpquery     smtp-source
[hussaif1 at lustwzb34 ~]$ sminfo
ibwarn: [10407] mad_rpc_open_port: can't open UMAD port ((null):0)
sminfo: iberror: failed: Failed to open '(null)' port '0'



> You originally had the OpenMPI which was provided by CentOS  ??

Correct.

> You compiled the OpenMPI from source??

Yes, I then compiled it from source and it seems to work ( at least  
give reasonable numbers when running latency and bandwith tests )..

> How are you bringing the new OpenMPI version itno your PATH ?? Are you
> using modules or an mpi switcher utilioty?

Just as follows:

export PATH=/Apps/users/hussaif1/openmpi-4.0.0/bin:$PATH

Thanks!

>
> On Wed, 1 May 2019 at 09:39, Benson Muite <benson_muite at emailplus.org>
> wrote:
>
>> Hi Faraz,
>>
>> Have you tried any other MPI distributions (eg. MPICH, MVAPICH)?
>>
>> Regards,
>>
>> Benson
>> On 4/30/19 11:20 PM, Gus Correa wrote:
>>
>> It may be using IPoIB (TCP/IP over IB), not verbs/rdma.
>> You can force it to use openib (verbs, rdma) with (vader is for in-node
>> shared memory):
>>
>> mpirun --mca btl openib,self,vader ...
>>
>>
>> These flags may also help tell which btl (byte transport layer) is  
>> being used:
>>
>>  --mca btl_base_verbose 30
>>
>> See these  
>> FAQ:https://www.open-mpi.org/faq/?category=openfabrics#ib-btlhttps://www.open-mpi.org/faq/?category=all#tcp-routability-1.3
>>
>> Better really ask more details in the Open MPI list. They are the pros!
>>
>> My two cents,
>> Gus Correa
>>
>>
>>
>> On Tue, Apr 30, 2019 at 3:57 PM Faraz Hussain <info at feacluster.com> wrote:
>>
>>> Thanks, after buidling openmpi 4 from source, it now works! However it
>>> still gives this message below when I run openmpi with verbose setting:
>>>
>>> No OpenFabrics connection schemes reported that they were able to be
>>> used on a specific port.  As such, the openib BTL (OpenFabrics
>>> support) will be disabled for this port.
>>>
>>>    Local host:           lustwzb34
>>>    Local device:         mlx4_0
>>>    Local port:           1
>>>    CPCs attempted:       rdmacm, udcm
>>>
>>> However, the results from my latency and bandwith tests seem to be
>>> what I would expect from infiniband. See:
>>>
>>> [hussaif1 at lustwzb34 pt2pt]$  mpirun -v -np 2 -hostfile ./hostfile
>>> ./osu_latency
>>> # OSU MPI Latency Test v5.3.2
>>> # Size          Latency (us)
>>> 0                       1.87
>>> 1                       1.88
>>> 2                       1.93
>>> 4                       1.92
>>> 8                       1.93
>>> 16                      1.95
>>> 32                      1.93
>>> 64                      2.08
>>> 128                     2.61
>>> 256                     2.72
>>> 512                     2.93
>>> 1024                    3.33
>>> 2048                    3.81
>>> 4096                    4.71
>>> 8192                    6.68
>>> 16384                   8.38
>>> 32768                  12.13
>>> 65536                  19.74
>>> 131072                 35.08
>>> 262144                 64.67
>>> 524288                122.11
>>> 1048576               236.69
>>> 2097152               465.97
>>> 4194304               926.31
>>>
>>> [hussaif1 at lustwzb34 pt2pt]$  mpirun -v -np 2 -hostfile ./hostfile
>>> ./osu_bw
>>> # OSU MPI Bandwidth Test v5.3.2
>>> # Size      Bandwidth (MB/s)
>>> 1                       3.09
>>> 2                       6.35
>>> 4                      12.77
>>> 8                      26.01
>>> 16                     51.31
>>> 32                    103.08
>>> 64                    197.89
>>> 128                   362.00
>>> 256                   676.28
>>> 512                  1096.26
>>> 1024                 1819.25
>>> 2048                 2551.41
>>> 4096                 3886.63
>>> 8192                 3983.17
>>> 16384                4362.30
>>> 32768                4457.09
>>> 65536                4502.41
>>> 131072               4512.64
>>> 262144               4531.48
>>> 524288               4537.42
>>> 1048576              4510.69
>>> 2097152              4546.64
>>> 4194304              4565.12
>>>
>>> When I run ibv_devinfo I get:
>>>
>>> [hussaif1 at lustwzb34 pt2pt]$ ibv_devinfo
>>> hca_id: mlx4_0
>>>          transport:                      InfiniBand (0)
>>>          fw_ver:                         2.36.5000
>>>          node_guid:                      480f:cfff:fff5:c6c0
>>>          sys_image_guid:                 480f:cfff:fff5:c6c3
>>>          vendor_id:                      0x02c9
>>>          vendor_part_id:                 4103
>>>          hw_ver:                         0x0
>>>          board_id:                       HP_1360110017
>>>          phys_port_cnt:                  2
>>>          Device ports:
>>>                  port:   1
>>>                          state:                  PORT_ACTIVE (4)
>>>                          max_mtu:                4096 (5)
>>>                          active_mtu:             1024 (3)
>>>                          sm_lid:                 0
>>>                          port_lid:               0
>>>                          port_lmc:               0x00
>>>                          link_layer:             Ethernet
>>>
>>>                  port:   2
>>>                          state:                  PORT_DOWN (1)
>>>                          max_mtu:                4096 (5)
>>>                          active_mtu:             1024 (3)
>>>                          sm_lid:                 0
>>>                          port_lid:               0
>>>                          port_lmc:               0x00
>>>                          link_layer:             Ethernet
>>>
>>> I will ask the openmpi mailing list if my results make sense?!
>>>
>>>
>>> Quoting Gus Correa <gus at ldeo.columbia.edu>:
>>>
>>> > Hi Faraz
>>> >
>>> > By all means, download the Open MPI tarball and build from source.
>>> > Otherwise there won't be support for IB (the CentOS Open MPI packages
>>> most
>>> > likely rely only on TCP/IP).
>>> >
>>> > Read their README file (it comes in the tarball), and take a careful
>>> look
>>> > at their (excellent) FAQ:
>>> > https://www.open-mpi.org/faq/
>>> > Many issues can be solved by just reading these two resources.
>>> >
>>> > If you hit more trouble, subscribe to the Open MPI mailing list, and ask
>>> > questions there,
>>> > because you will get advice directly from the Open MPI developers, and
>>> the
>>> > fix will come easy.
>>> > https://www.open-mpi.org/community/lists/ompi.php
>>> >
>>> > My two cents,
>>> > Gus Correa
>>> >
>>> > On Tue, Apr 30, 2019 at 3:07 PM Faraz Hussain <info at feacluster.com>
>>> wrote:
>>> >
>>> >> Thanks, yes I have installed those libraries. See below. Initially I
>>> >> installed the libraries via yum. But then I tried installing the rpms
>>> >> directly from Mellanox website (
>>> >> MLNX_OFED_LINUX-4.5-1.0.1.0-rhel7.5-x86_64.tar ). Even after doing
>>> >> that, I still got the same error with openmpi. I will try your
>>> >> suggestion of building openmpi from source next!
>>> >>
>>> >> root at lustwzb34:/root # yum list | grep ibverbs
>>> >> libibverbs.x86_64                     41mlnx1-OFED.4.5.0.1.0.45101
>>> >> libibverbs-devel.x86_64               41mlnx1-OFED.4.5.0.1.0.45101
>>> >> libibverbs-devel-static.x86_64        41mlnx1-OFED.4.5.0.1.0.45101
>>> >> libibverbs-utils.x86_64               41mlnx1-OFED.4.5.0.1.0.45101
>>> >> libibverbs.i686                       17.2-3.el7
>>> >> rhel-7-server-rpms
>>> >> libibverbs-devel.i686                 1.2.1-1.el7
>>> >> rhel-7-server-rpms
>>> >>
>>> >> root at lustwzb34:/root # lsmod | grep ib
>>> >> ib_ucm                 22602  0
>>> >> ib_ipoib              168425  0
>>> >> ib_cm                  53141  3 rdma_cm,ib_ucm,ib_ipoib
>>> >> ib_umad                22093  0
>>> >> mlx5_ib               339961  0
>>> >> ib_uverbs             121821  3 mlx5_ib,ib_ucm,rdma_ucm
>>> >> mlx5_core             919178  2 mlx5_ib,mlx5_fpga_tools
>>> >> mlx4_ib               211747  0
>>> >> ib_core               294554  10
>>> >>
>>> >>
>>> rdma_cm,ib_cm,iw_cm,mlx4_ib,mlx5_ib,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib
>>> >> mlx4_core             360598  2 mlx4_en,mlx4_ib
>>> >> mlx_compat             29012  15
>>> >>
>>> >>
>>> rdma_cm,ib_cm,iw_cm,mlx4_en,mlx4_ib,mlx5_ib,mlx5_fpga_tools,ib_ucm,ib_core,ib_umad,ib_uverbs,mlx4_core,mlx5_core,rdma_ucm,ib_ipoib
>>> >> devlink                42368  4 mlx4_en,mlx4_ib,mlx4_core,mlx5_core
>>> >> libcrc32c              12644  3 xfs,nf_nat,nf_conntrack
>>> >> root at lustwzb34:/root #
>>> >>
>>> >>
>>> >>
>>> >> > Did you install libibverbs  (and libibverbs-utils, for information
>>> and
>>> >> > troubleshooting)?
>>> >>
>>> >> > yum list |grep ibverbs
>>> >>
>>> >> > Are you loading the ib modules?
>>> >>
>>> >> > lsmod |grep ib
>>> >>
>>> >>
>>>
>>>
>>>
>>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit  
>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>>





More information about the Beowulf mailing list