[Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

John Hearns hearnsj at googlemail.com
Wed May 1 07:40:03 PDT 2019


I think I he wrong track regarding the subnet manager, sorry.
What does   ibstatus   give you

On Wed, 1 May 2019 at 15:31, John Hearns <hearnsj at googlemail.com> wrote:

> Errrr..   you are not running a subnet manager?
> DO you have an Infiniband switch or are you connecting two servers
> back-to-back?
>
> Also - have you considered using OpenHPC rather tyhan installing CentOS on
> two servers?
> When you expand this manual installation is going to be painful.
>
> On Wed, 1 May 2019 at 15:05, Faraz Hussain <info at feacluster.com> wrote:
>
>> > What hardware and what Infiniband switch you have
>> > Run   these commands:      ibdiagnet   smshow
>>
>> Unfortunately ibdiagnet seems to give some errors:
>>
>> [hussaif1 at lustwzb34 ~]$ ibdiagnet
>> ----------
>> Load Plugins from:
>> /usr/share/ibdiagnet2.1.1/plugins/
>> (You can specify more paths to be looked in with
>> "IBDIAGNET_PLUGINS_PATH" env variable)
>>
>> Plugin Name                                   Result     Comment
>> libibdiagnet_cable_diag_plugin-2.1.1          Succeeded  Plugin loaded
>> libibdiagnet_phy_diag_plugin-2.1.1            Succeeded  Plugin loaded
>>
>> ---------------------------------------------
>> Discovery
>> -E- Failed to initialize
>>
>> -E- Fabric Discover failed, err=IBDiag initialize wasn't done
>> -E- Fabric Discover failed, MAD err=Failed to umad_open_port
>>
>> ---------------------------------------------
>> Summary
>> -I- Stage                     Warnings   Errors     Comment
>> -I- Discovery                                       NA
>> -I- Lids Check                                      NA
>> -I- Links Check                                     NA
>> -I- Subnet Manager                                  NA
>> -I- Port Counters                                   NA
>> -I- Nodes Information                               NA
>> -I- Speed / Width checks                            NA
>> -I- Partition Keys                                  NA
>> -I- Alias GUIDs                                     NA
>> -I- Temperature Sensing                             NA
>>
>> -I- You can find detailed errors/warnings in:
>> /var/tmp/ibdiagnet2/ibdiagnet2.log
>>
>> -E- A fatal error occurred, exiting...
>>
>>
>> I do not have smshow command , but I see there is an sminfo. It also
>> give this error:
>>
>> [hussaif1 at lustwzb34 ~]$ smshow
>> bash: smshow: command not found...
>> [hussaif1 at lustwzb34 ~]$ sm
>> smartctl     smbcacls     smbcquotas   smbspool     smbtree
>> sm-notify    smpdump      smtp-sink
>> smartd       smbclient    smbget       smbtar       sminfo
>> smparquery   smpquery     smtp-source
>> [hussaif1 at lustwzb34 ~]$ sminfo
>> ibwarn: [10407] mad_rpc_open_port: can't open UMAD port ((null):0)
>> sminfo: iberror: failed: Failed to open '(null)' port '0'
>>
>>
>>
>> > You originally had the OpenMPI which was provided by CentOS  ??
>>
>> Correct.
>>
>> > You compiled the OpenMPI from source??
>>
>> Yes, I then compiled it from source and it seems to work ( at least
>> give reasonable numbers when running latency and bandwith tests )..
>>
>> > How are you bringing the new OpenMPI version itno your PATH ?? Are you
>> > using modules or an mpi switcher utilioty?
>>
>> Just as follows:
>>
>> export PATH=/Apps/users/hussaif1/openmpi-4.0.0/bin:$PATH
>>
>> Thanks!
>>
>> >
>> > On Wed, 1 May 2019 at 09:39, Benson Muite <benson_muite at emailplus.org>
>> > wrote:
>> >
>> >> Hi Faraz,
>> >>
>> >> Have you tried any other MPI distributions (eg. MPICH, MVAPICH)?
>> >>
>> >> Regards,
>> >>
>> >> Benson
>> >> On 4/30/19 11:20 PM, Gus Correa wrote:
>> >>
>> >> It may be using IPoIB (TCP/IP over IB), not verbs/rdma.
>> >> You can force it to use openib (verbs, rdma) with (vader is for in-node
>> >> shared memory):
>> >>
>> >> mpirun --mca btl openib,self,vader ...
>> >>
>> >>
>> >> These flags may also help tell which btl (byte transport layer) is
>> >> being used:
>> >>
>> >>  --mca btl_base_verbose 30
>> >>
>> >> See these
>> >> FAQ:
>> https://www.open-mpi.org/faq/?category=openfabrics#ib-btlhttps://www.open-mpi.org/faq/?category=all#tcp-routability-1.3
>> >>
>> >> Better really ask more details in the Open MPI list. They are the pros!
>> >>
>> >> My two cents,
>> >> Gus Correa
>> >>
>> >>
>> >>
>> >> On Tue, Apr 30, 2019 at 3:57 PM Faraz Hussain <info at feacluster.com>
>> wrote:
>> >>
>> >>> Thanks, after buidling openmpi 4 from source, it now works! However it
>> >>> still gives this message below when I run openmpi with verbose
>> setting:
>> >>>
>> >>> No OpenFabrics connection schemes reported that they were able to be
>> >>> used on a specific port.  As such, the openib BTL (OpenFabrics
>> >>> support) will be disabled for this port.
>> >>>
>> >>>    Local host:           lustwzb34
>> >>>    Local device:         mlx4_0
>> >>>    Local port:           1
>> >>>    CPCs attempted:       rdmacm, udcm
>> >>>
>> >>> However, the results from my latency and bandwith tests seem to be
>> >>> what I would expect from infiniband. See:
>> >>>
>> >>> [hussaif1 at lustwzb34 pt2pt]$  mpirun -v -np 2 -hostfile ./hostfile
>> >>> ./osu_latency
>> >>> # OSU MPI Latency Test v5.3.2
>> >>> # Size          Latency (us)
>> >>> 0                       1.87
>> >>> 1                       1.88
>> >>> 2                       1.93
>> >>> 4                       1.92
>> >>> 8                       1.93
>> >>> 16                      1.95
>> >>> 32                      1.93
>> >>> 64                      2.08
>> >>> 128                     2.61
>> >>> 256                     2.72
>> >>> 512                     2.93
>> >>> 1024                    3.33
>> >>> 2048                    3.81
>> >>> 4096                    4.71
>> >>> 8192                    6.68
>> >>> 16384                   8.38
>> >>> 32768                  12.13
>> >>> 65536                  19.74
>> >>> 131072                 35.08
>> >>> 262144                 64.67
>> >>> 524288                122.11
>> >>> 1048576               236.69
>> >>> 2097152               465.97
>> >>> 4194304               926.31
>> >>>
>> >>> [hussaif1 at lustwzb34 pt2pt]$  mpirun -v -np 2 -hostfile ./hostfile
>> >>> ./osu_bw
>> >>> # OSU MPI Bandwidth Test v5.3.2
>> >>> # Size      Bandwidth (MB/s)
>> >>> 1                       3.09
>> >>> 2                       6.35
>> >>> 4                      12.77
>> >>> 8                      26.01
>> >>> 16                     51.31
>> >>> 32                    103.08
>> >>> 64                    197.89
>> >>> 128                   362.00
>> >>> 256                   676.28
>> >>> 512                  1096.26
>> >>> 1024                 1819.25
>> >>> 2048                 2551.41
>> >>> 4096                 3886.63
>> >>> 8192                 3983.17
>> >>> 16384                4362.30
>> >>> 32768                4457.09
>> >>> 65536                4502.41
>> >>> 131072               4512.64
>> >>> 262144               4531.48
>> >>> 524288               4537.42
>> >>> 1048576              4510.69
>> >>> 2097152              4546.64
>> >>> 4194304              4565.12
>> >>>
>> >>> When I run ibv_devinfo I get:
>> >>>
>> >>> [hussaif1 at lustwzb34 pt2pt]$ ibv_devinfo
>> >>> hca_id: mlx4_0
>> >>>          transport:                      InfiniBand (0)
>> >>>          fw_ver:                         2.36.5000
>> >>>          node_guid:                      480f:cfff:fff5:c6c0
>> >>>          sys_image_guid:                 480f:cfff:fff5:c6c3
>> >>>          vendor_id:                      0x02c9
>> >>>          vendor_part_id:                 4103
>> >>>          hw_ver:                         0x0
>> >>>          board_id:                       HP_1360110017
>> >>>          phys_port_cnt:                  2
>> >>>          Device ports:
>> >>>                  port:   1
>> >>>                          state:                  PORT_ACTIVE (4)
>> >>>                          max_mtu:                4096 (5)
>> >>>                          active_mtu:             1024 (3)
>> >>>                          sm_lid:                 0
>> >>>                          port_lid:               0
>> >>>                          port_lmc:               0x00
>> >>>                          link_layer:             Ethernet
>> >>>
>> >>>                  port:   2
>> >>>                          state:                  PORT_DOWN (1)
>> >>>                          max_mtu:                4096 (5)
>> >>>                          active_mtu:             1024 (3)
>> >>>                          sm_lid:                 0
>> >>>                          port_lid:               0
>> >>>                          port_lmc:               0x00
>> >>>                          link_layer:             Ethernet
>> >>>
>> >>> I will ask the openmpi mailing list if my results make sense?!
>> >>>
>> >>>
>> >>> Quoting Gus Correa <gus at ldeo.columbia.edu>:
>> >>>
>> >>> > Hi Faraz
>> >>> >
>> >>> > By all means, download the Open MPI tarball and build from source.
>> >>> > Otherwise there won't be support for IB (the CentOS Open MPI
>> packages
>> >>> most
>> >>> > likely rely only on TCP/IP).
>> >>> >
>> >>> > Read their README file (it comes in the tarball), and take a careful
>> >>> look
>> >>> > at their (excellent) FAQ:
>> >>> > https://www.open-mpi.org/faq/
>> >>> > Many issues can be solved by just reading these two resources.
>> >>> >
>> >>> > If you hit more trouble, subscribe to the Open MPI mailing list,
>> and ask
>> >>> > questions there,
>> >>> > because you will get advice directly from the Open MPI developers,
>> and
>> >>> the
>> >>> > fix will come easy.
>> >>> > https://www.open-mpi.org/community/lists/ompi.php
>> >>> >
>> >>> > My two cents,
>> >>> > Gus Correa
>> >>> >
>> >>> > On Tue, Apr 30, 2019 at 3:07 PM Faraz Hussain <info at feacluster.com>
>> >>> wrote:
>> >>> >
>> >>> >> Thanks, yes I have installed those libraries. See below. Initially
>> I
>> >>> >> installed the libraries via yum. But then I tried installing the
>> rpms
>> >>> >> directly from Mellanox website (
>> >>> >> MLNX_OFED_LINUX-4.5-1.0.1.0-rhel7.5-x86_64.tar ). Even after doing
>> >>> >> that, I still got the same error with openmpi. I will try your
>> >>> >> suggestion of building openmpi from source next!
>> >>> >>
>> >>> >> root at lustwzb34:/root # yum list | grep ibverbs
>> >>> >> libibverbs.x86_64                     41mlnx1-OFED.4.5.0.1.0.45101
>> >>> >> libibverbs-devel.x86_64               41mlnx1-OFED.4.5.0.1.0.45101
>> >>> >> libibverbs-devel-static.x86_64        41mlnx1-OFED.4.5.0.1.0.45101
>> >>> >> libibverbs-utils.x86_64               41mlnx1-OFED.4.5.0.1.0.45101
>> >>> >> libibverbs.i686                       17.2-3.el7
>> >>> >> rhel-7-server-rpms
>> >>> >> libibverbs-devel.i686                 1.2.1-1.el7
>> >>> >> rhel-7-server-rpms
>> >>> >>
>> >>> >> root at lustwzb34:/root # lsmod | grep ib
>> >>> >> ib_ucm                 22602  0
>> >>> >> ib_ipoib              168425  0
>> >>> >> ib_cm                  53141  3 rdma_cm,ib_ucm,ib_ipoib
>> >>> >> ib_umad                22093  0
>> >>> >> mlx5_ib               339961  0
>> >>> >> ib_uverbs             121821  3 mlx5_ib,ib_ucm,rdma_ucm
>> >>> >> mlx5_core             919178  2 mlx5_ib,mlx5_fpga_tools
>> >>> >> mlx4_ib               211747  0
>> >>> >> ib_core               294554  10
>> >>> >>
>> >>> >>
>> >>>
>> rdma_cm,ib_cm,iw_cm,mlx4_ib,mlx5_ib,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib
>> >>> >> mlx4_core             360598  2 mlx4_en,mlx4_ib
>> >>> >> mlx_compat             29012  15
>> >>> >>
>> >>> >>
>> >>>
>> rdma_cm,ib_cm,iw_cm,mlx4_en,mlx4_ib,mlx5_ib,mlx5_fpga_tools,ib_ucm,ib_core,ib_umad,ib_uverbs,mlx4_core,mlx5_core,rdma_ucm,ib_ipoib
>> >>> >> devlink                42368  4 mlx4_en,mlx4_ib,mlx4_core,mlx5_core
>> >>> >> libcrc32c              12644  3 xfs,nf_nat,nf_conntrack
>> >>> >> root at lustwzb34:/root #
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> > Did you install libibverbs  (and libibverbs-utils, for
>> information
>> >>> and
>> >>> >> > troubleshooting)?
>> >>> >>
>> >>> >> > yum list |grep ibverbs
>> >>> >>
>> >>> >> > Are you loading the ib modules?
>> >>> >>
>> >>> >> > lsmod |grep ib
>> >>> >>
>> >>> >>
>> >>>
>> >>>
>> >>>
>> >>>
>> >> _______________________________________________
>> >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>> Computing
>> >> To change your subscription (digest mode or unsubscribe) visit
>> >> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>> >>
>> >> _______________________________________________
>> >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>> Computing
>> >> To change your subscription (digest mode or unsubscribe) visit
>> >> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>> >>
>>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20190501/95c99f39/attachment-0001.html>


More information about the Beowulf mailing list