[Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?
John Hearns
hearnsj at googlemail.com
Wed May 1 02:10:30 PDT 2019
Hi Faraz. Could to make another summary for us?
What hardware and what Infiniband switch you have
Run these commands: ibdiagnet smshow
You originally had the OpenMPI which was provided by CentOS ??
You compiled the OpenMPI from source??
How are you bringing the new OpenMPI version itno your PATH ?? Are you
using modules or an mpi switcher utilioty?
On Wed, 1 May 2019 at 09:39, Benson Muite <benson_muite at emailplus.org>
wrote:
> Hi Faraz,
>
> Have you tried any other MPI distributions (eg. MPICH, MVAPICH)?
>
> Regards,
>
> Benson
> On 4/30/19 11:20 PM, Gus Correa wrote:
>
> It may be using IPoIB (TCP/IP over IB), not verbs/rdma.
> You can force it to use openib (verbs, rdma) with (vader is for in-node
> shared memory):
>
> mpirun --mca btl openib,self,vader ...
>
>
> These flags may also help tell which btl (byte transport layer) is being used:
>
> --mca btl_base_verbose 30
>
> See these FAQ:https://www.open-mpi.org/faq/?category=openfabrics#ib-btlhttps://www.open-mpi.org/faq/?category=all#tcp-routability-1.3
>
> Better really ask more details in the Open MPI list. They are the pros!
>
> My two cents,
> Gus Correa
>
>
>
> On Tue, Apr 30, 2019 at 3:57 PM Faraz Hussain <info at feacluster.com> wrote:
>
>> Thanks, after buidling openmpi 4 from source, it now works! However it
>> still gives this message below when I run openmpi with verbose setting:
>>
>> No OpenFabrics connection schemes reported that they were able to be
>> used on a specific port. As such, the openib BTL (OpenFabrics
>> support) will be disabled for this port.
>>
>> Local host: lustwzb34
>> Local device: mlx4_0
>> Local port: 1
>> CPCs attempted: rdmacm, udcm
>>
>> However, the results from my latency and bandwith tests seem to be
>> what I would expect from infiniband. See:
>>
>> [hussaif1 at lustwzb34 pt2pt]$ mpirun -v -np 2 -hostfile ./hostfile
>> ./osu_latency
>> # OSU MPI Latency Test v5.3.2
>> # Size Latency (us)
>> 0 1.87
>> 1 1.88
>> 2 1.93
>> 4 1.92
>> 8 1.93
>> 16 1.95
>> 32 1.93
>> 64 2.08
>> 128 2.61
>> 256 2.72
>> 512 2.93
>> 1024 3.33
>> 2048 3.81
>> 4096 4.71
>> 8192 6.68
>> 16384 8.38
>> 32768 12.13
>> 65536 19.74
>> 131072 35.08
>> 262144 64.67
>> 524288 122.11
>> 1048576 236.69
>> 2097152 465.97
>> 4194304 926.31
>>
>> [hussaif1 at lustwzb34 pt2pt]$ mpirun -v -np 2 -hostfile ./hostfile
>> ./osu_bw
>> # OSU MPI Bandwidth Test v5.3.2
>> # Size Bandwidth (MB/s)
>> 1 3.09
>> 2 6.35
>> 4 12.77
>> 8 26.01
>> 16 51.31
>> 32 103.08
>> 64 197.89
>> 128 362.00
>> 256 676.28
>> 512 1096.26
>> 1024 1819.25
>> 2048 2551.41
>> 4096 3886.63
>> 8192 3983.17
>> 16384 4362.30
>> 32768 4457.09
>> 65536 4502.41
>> 131072 4512.64
>> 262144 4531.48
>> 524288 4537.42
>> 1048576 4510.69
>> 2097152 4546.64
>> 4194304 4565.12
>>
>> When I run ibv_devinfo I get:
>>
>> [hussaif1 at lustwzb34 pt2pt]$ ibv_devinfo
>> hca_id: mlx4_0
>> transport: InfiniBand (0)
>> fw_ver: 2.36.5000
>> node_guid: 480f:cfff:fff5:c6c0
>> sys_image_guid: 480f:cfff:fff5:c6c3
>> vendor_id: 0x02c9
>> vendor_part_id: 4103
>> hw_ver: 0x0
>> board_id: HP_1360110017
>> phys_port_cnt: 2
>> Device ports:
>> port: 1
>> state: PORT_ACTIVE (4)
>> max_mtu: 4096 (5)
>> active_mtu: 1024 (3)
>> sm_lid: 0
>> port_lid: 0
>> port_lmc: 0x00
>> link_layer: Ethernet
>>
>> port: 2
>> state: PORT_DOWN (1)
>> max_mtu: 4096 (5)
>> active_mtu: 1024 (3)
>> sm_lid: 0
>> port_lid: 0
>> port_lmc: 0x00
>> link_layer: Ethernet
>>
>> I will ask the openmpi mailing list if my results make sense?!
>>
>>
>> Quoting Gus Correa <gus at ldeo.columbia.edu>:
>>
>> > Hi Faraz
>> >
>> > By all means, download the Open MPI tarball and build from source.
>> > Otherwise there won't be support for IB (the CentOS Open MPI packages
>> most
>> > likely rely only on TCP/IP).
>> >
>> > Read their README file (it comes in the tarball), and take a careful
>> look
>> > at their (excellent) FAQ:
>> > https://www.open-mpi.org/faq/
>> > Many issues can be solved by just reading these two resources.
>> >
>> > If you hit more trouble, subscribe to the Open MPI mailing list, and ask
>> > questions there,
>> > because you will get advice directly from the Open MPI developers, and
>> the
>> > fix will come easy.
>> > https://www.open-mpi.org/community/lists/ompi.php
>> >
>> > My two cents,
>> > Gus Correa
>> >
>> > On Tue, Apr 30, 2019 at 3:07 PM Faraz Hussain <info at feacluster.com>
>> wrote:
>> >
>> >> Thanks, yes I have installed those libraries. See below. Initially I
>> >> installed the libraries via yum. But then I tried installing the rpms
>> >> directly from Mellanox website (
>> >> MLNX_OFED_LINUX-4.5-1.0.1.0-rhel7.5-x86_64.tar ). Even after doing
>> >> that, I still got the same error with openmpi. I will try your
>> >> suggestion of building openmpi from source next!
>> >>
>> >> root at lustwzb34:/root # yum list | grep ibverbs
>> >> libibverbs.x86_64 41mlnx1-OFED.4.5.0.1.0.45101
>> >> libibverbs-devel.x86_64 41mlnx1-OFED.4.5.0.1.0.45101
>> >> libibverbs-devel-static.x86_64 41mlnx1-OFED.4.5.0.1.0.45101
>> >> libibverbs-utils.x86_64 41mlnx1-OFED.4.5.0.1.0.45101
>> >> libibverbs.i686 17.2-3.el7
>> >> rhel-7-server-rpms
>> >> libibverbs-devel.i686 1.2.1-1.el7
>> >> rhel-7-server-rpms
>> >>
>> >> root at lustwzb34:/root # lsmod | grep ib
>> >> ib_ucm 22602 0
>> >> ib_ipoib 168425 0
>> >> ib_cm 53141 3 rdma_cm,ib_ucm,ib_ipoib
>> >> ib_umad 22093 0
>> >> mlx5_ib 339961 0
>> >> ib_uverbs 121821 3 mlx5_ib,ib_ucm,rdma_ucm
>> >> mlx5_core 919178 2 mlx5_ib,mlx5_fpga_tools
>> >> mlx4_ib 211747 0
>> >> ib_core 294554 10
>> >>
>> >>
>> rdma_cm,ib_cm,iw_cm,mlx4_ib,mlx5_ib,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib
>> >> mlx4_core 360598 2 mlx4_en,mlx4_ib
>> >> mlx_compat 29012 15
>> >>
>> >>
>> rdma_cm,ib_cm,iw_cm,mlx4_en,mlx4_ib,mlx5_ib,mlx5_fpga_tools,ib_ucm,ib_core,ib_umad,ib_uverbs,mlx4_core,mlx5_core,rdma_ucm,ib_ipoib
>> >> devlink 42368 4 mlx4_en,mlx4_ib,mlx4_core,mlx5_core
>> >> libcrc32c 12644 3 xfs,nf_nat,nf_conntrack
>> >> root at lustwzb34:/root #
>> >>
>> >>
>> >>
>> >> > Did you install libibverbs (and libibverbs-utils, for information
>> and
>> >> > troubleshooting)?
>> >>
>> >> > yum list |grep ibverbs
>> >>
>> >> > Are you loading the ib modules?
>> >>
>> >> > lsmod |grep ib
>> >>
>> >>
>>
>>
>>
>>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20190501/b034a831/attachment-0001.html>
More information about the Beowulf
mailing list