[Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?
John Hearns
hearnsj at googlemail.com
Wed May 1 07:31:27 PDT 2019
Errrr.. you are not running a subnet manager?
DO you have an Infiniband switch or are you connecting two servers
back-to-back?
Also - have you considered using OpenHPC rather tyhan installing CentOS on
two servers?
When you expand this manual installation is going to be painful.
On Wed, 1 May 2019 at 15:05, Faraz Hussain <info at feacluster.com> wrote:
> > What hardware and what Infiniband switch you have
> > Run these commands: ibdiagnet smshow
>
> Unfortunately ibdiagnet seems to give some errors:
>
> [hussaif1 at lustwzb34 ~]$ ibdiagnet
> ----------
> Load Plugins from:
> /usr/share/ibdiagnet2.1.1/plugins/
> (You can specify more paths to be looked in with
> "IBDIAGNET_PLUGINS_PATH" env variable)
>
> Plugin Name Result Comment
> libibdiagnet_cable_diag_plugin-2.1.1 Succeeded Plugin loaded
> libibdiagnet_phy_diag_plugin-2.1.1 Succeeded Plugin loaded
>
> ---------------------------------------------
> Discovery
> -E- Failed to initialize
>
> -E- Fabric Discover failed, err=IBDiag initialize wasn't done
> -E- Fabric Discover failed, MAD err=Failed to umad_open_port
>
> ---------------------------------------------
> Summary
> -I- Stage Warnings Errors Comment
> -I- Discovery NA
> -I- Lids Check NA
> -I- Links Check NA
> -I- Subnet Manager NA
> -I- Port Counters NA
> -I- Nodes Information NA
> -I- Speed / Width checks NA
> -I- Partition Keys NA
> -I- Alias GUIDs NA
> -I- Temperature Sensing NA
>
> -I- You can find detailed errors/warnings in:
> /var/tmp/ibdiagnet2/ibdiagnet2.log
>
> -E- A fatal error occurred, exiting...
>
>
> I do not have smshow command , but I see there is an sminfo. It also
> give this error:
>
> [hussaif1 at lustwzb34 ~]$ smshow
> bash: smshow: command not found...
> [hussaif1 at lustwzb34 ~]$ sm
> smartctl smbcacls smbcquotas smbspool smbtree
> sm-notify smpdump smtp-sink
> smartd smbclient smbget smbtar sminfo
> smparquery smpquery smtp-source
> [hussaif1 at lustwzb34 ~]$ sminfo
> ibwarn: [10407] mad_rpc_open_port: can't open UMAD port ((null):0)
> sminfo: iberror: failed: Failed to open '(null)' port '0'
>
>
>
> > You originally had the OpenMPI which was provided by CentOS ??
>
> Correct.
>
> > You compiled the OpenMPI from source??
>
> Yes, I then compiled it from source and it seems to work ( at least
> give reasonable numbers when running latency and bandwith tests )..
>
> > How are you bringing the new OpenMPI version itno your PATH ?? Are you
> > using modules or an mpi switcher utilioty?
>
> Just as follows:
>
> export PATH=/Apps/users/hussaif1/openmpi-4.0.0/bin:$PATH
>
> Thanks!
>
> >
> > On Wed, 1 May 2019 at 09:39, Benson Muite <benson_muite at emailplus.org>
> > wrote:
> >
> >> Hi Faraz,
> >>
> >> Have you tried any other MPI distributions (eg. MPICH, MVAPICH)?
> >>
> >> Regards,
> >>
> >> Benson
> >> On 4/30/19 11:20 PM, Gus Correa wrote:
> >>
> >> It may be using IPoIB (TCP/IP over IB), not verbs/rdma.
> >> You can force it to use openib (verbs, rdma) with (vader is for in-node
> >> shared memory):
> >>
> >> mpirun --mca btl openib,self,vader ...
> >>
> >>
> >> These flags may also help tell which btl (byte transport layer) is
> >> being used:
> >>
> >> --mca btl_base_verbose 30
> >>
> >> See these
> >> FAQ:
> https://www.open-mpi.org/faq/?category=openfabrics#ib-btlhttps://www.open-mpi.org/faq/?category=all#tcp-routability-1.3
> >>
> >> Better really ask more details in the Open MPI list. They are the pros!
> >>
> >> My two cents,
> >> Gus Correa
> >>
> >>
> >>
> >> On Tue, Apr 30, 2019 at 3:57 PM Faraz Hussain <info at feacluster.com>
> wrote:
> >>
> >>> Thanks, after buidling openmpi 4 from source, it now works! However it
> >>> still gives this message below when I run openmpi with verbose setting:
> >>>
> >>> No OpenFabrics connection schemes reported that they were able to be
> >>> used on a specific port. As such, the openib BTL (OpenFabrics
> >>> support) will be disabled for this port.
> >>>
> >>> Local host: lustwzb34
> >>> Local device: mlx4_0
> >>> Local port: 1
> >>> CPCs attempted: rdmacm, udcm
> >>>
> >>> However, the results from my latency and bandwith tests seem to be
> >>> what I would expect from infiniband. See:
> >>>
> >>> [hussaif1 at lustwzb34 pt2pt]$ mpirun -v -np 2 -hostfile ./hostfile
> >>> ./osu_latency
> >>> # OSU MPI Latency Test v5.3.2
> >>> # Size Latency (us)
> >>> 0 1.87
> >>> 1 1.88
> >>> 2 1.93
> >>> 4 1.92
> >>> 8 1.93
> >>> 16 1.95
> >>> 32 1.93
> >>> 64 2.08
> >>> 128 2.61
> >>> 256 2.72
> >>> 512 2.93
> >>> 1024 3.33
> >>> 2048 3.81
> >>> 4096 4.71
> >>> 8192 6.68
> >>> 16384 8.38
> >>> 32768 12.13
> >>> 65536 19.74
> >>> 131072 35.08
> >>> 262144 64.67
> >>> 524288 122.11
> >>> 1048576 236.69
> >>> 2097152 465.97
> >>> 4194304 926.31
> >>>
> >>> [hussaif1 at lustwzb34 pt2pt]$ mpirun -v -np 2 -hostfile ./hostfile
> >>> ./osu_bw
> >>> # OSU MPI Bandwidth Test v5.3.2
> >>> # Size Bandwidth (MB/s)
> >>> 1 3.09
> >>> 2 6.35
> >>> 4 12.77
> >>> 8 26.01
> >>> 16 51.31
> >>> 32 103.08
> >>> 64 197.89
> >>> 128 362.00
> >>> 256 676.28
> >>> 512 1096.26
> >>> 1024 1819.25
> >>> 2048 2551.41
> >>> 4096 3886.63
> >>> 8192 3983.17
> >>> 16384 4362.30
> >>> 32768 4457.09
> >>> 65536 4502.41
> >>> 131072 4512.64
> >>> 262144 4531.48
> >>> 524288 4537.42
> >>> 1048576 4510.69
> >>> 2097152 4546.64
> >>> 4194304 4565.12
> >>>
> >>> When I run ibv_devinfo I get:
> >>>
> >>> [hussaif1 at lustwzb34 pt2pt]$ ibv_devinfo
> >>> hca_id: mlx4_0
> >>> transport: InfiniBand (0)
> >>> fw_ver: 2.36.5000
> >>> node_guid: 480f:cfff:fff5:c6c0
> >>> sys_image_guid: 480f:cfff:fff5:c6c3
> >>> vendor_id: 0x02c9
> >>> vendor_part_id: 4103
> >>> hw_ver: 0x0
> >>> board_id: HP_1360110017
> >>> phys_port_cnt: 2
> >>> Device ports:
> >>> port: 1
> >>> state: PORT_ACTIVE (4)
> >>> max_mtu: 4096 (5)
> >>> active_mtu: 1024 (3)
> >>> sm_lid: 0
> >>> port_lid: 0
> >>> port_lmc: 0x00
> >>> link_layer: Ethernet
> >>>
> >>> port: 2
> >>> state: PORT_DOWN (1)
> >>> max_mtu: 4096 (5)
> >>> active_mtu: 1024 (3)
> >>> sm_lid: 0
> >>> port_lid: 0
> >>> port_lmc: 0x00
> >>> link_layer: Ethernet
> >>>
> >>> I will ask the openmpi mailing list if my results make sense?!
> >>>
> >>>
> >>> Quoting Gus Correa <gus at ldeo.columbia.edu>:
> >>>
> >>> > Hi Faraz
> >>> >
> >>> > By all means, download the Open MPI tarball and build from source.
> >>> > Otherwise there won't be support for IB (the CentOS Open MPI packages
> >>> most
> >>> > likely rely only on TCP/IP).
> >>> >
> >>> > Read their README file (it comes in the tarball), and take a careful
> >>> look
> >>> > at their (excellent) FAQ:
> >>> > https://www.open-mpi.org/faq/
> >>> > Many issues can be solved by just reading these two resources.
> >>> >
> >>> > If you hit more trouble, subscribe to the Open MPI mailing list, and
> ask
> >>> > questions there,
> >>> > because you will get advice directly from the Open MPI developers,
> and
> >>> the
> >>> > fix will come easy.
> >>> > https://www.open-mpi.org/community/lists/ompi.php
> >>> >
> >>> > My two cents,
> >>> > Gus Correa
> >>> >
> >>> > On Tue, Apr 30, 2019 at 3:07 PM Faraz Hussain <info at feacluster.com>
> >>> wrote:
> >>> >
> >>> >> Thanks, yes I have installed those libraries. See below. Initially I
> >>> >> installed the libraries via yum. But then I tried installing the
> rpms
> >>> >> directly from Mellanox website (
> >>> >> MLNX_OFED_LINUX-4.5-1.0.1.0-rhel7.5-x86_64.tar ). Even after doing
> >>> >> that, I still got the same error with openmpi. I will try your
> >>> >> suggestion of building openmpi from source next!
> >>> >>
> >>> >> root at lustwzb34:/root # yum list | grep ibverbs
> >>> >> libibverbs.x86_64 41mlnx1-OFED.4.5.0.1.0.45101
> >>> >> libibverbs-devel.x86_64 41mlnx1-OFED.4.5.0.1.0.45101
> >>> >> libibverbs-devel-static.x86_64 41mlnx1-OFED.4.5.0.1.0.45101
> >>> >> libibverbs-utils.x86_64 41mlnx1-OFED.4.5.0.1.0.45101
> >>> >> libibverbs.i686 17.2-3.el7
> >>> >> rhel-7-server-rpms
> >>> >> libibverbs-devel.i686 1.2.1-1.el7
> >>> >> rhel-7-server-rpms
> >>> >>
> >>> >> root at lustwzb34:/root # lsmod | grep ib
> >>> >> ib_ucm 22602 0
> >>> >> ib_ipoib 168425 0
> >>> >> ib_cm 53141 3 rdma_cm,ib_ucm,ib_ipoib
> >>> >> ib_umad 22093 0
> >>> >> mlx5_ib 339961 0
> >>> >> ib_uverbs 121821 3 mlx5_ib,ib_ucm,rdma_ucm
> >>> >> mlx5_core 919178 2 mlx5_ib,mlx5_fpga_tools
> >>> >> mlx4_ib 211747 0
> >>> >> ib_core 294554 10
> >>> >>
> >>> >>
> >>>
> rdma_cm,ib_cm,iw_cm,mlx4_ib,mlx5_ib,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib
> >>> >> mlx4_core 360598 2 mlx4_en,mlx4_ib
> >>> >> mlx_compat 29012 15
> >>> >>
> >>> >>
> >>>
> rdma_cm,ib_cm,iw_cm,mlx4_en,mlx4_ib,mlx5_ib,mlx5_fpga_tools,ib_ucm,ib_core,ib_umad,ib_uverbs,mlx4_core,mlx5_core,rdma_ucm,ib_ipoib
> >>> >> devlink 42368 4 mlx4_en,mlx4_ib,mlx4_core,mlx5_core
> >>> >> libcrc32c 12644 3 xfs,nf_nat,nf_conntrack
> >>> >> root at lustwzb34:/root #
> >>> >>
> >>> >>
> >>> >>
> >>> >> > Did you install libibverbs (and libibverbs-utils, for information
> >>> and
> >>> >> > troubleshooting)?
> >>> >>
> >>> >> > yum list |grep ibverbs
> >>> >>
> >>> >> > Are you loading the ib modules?
> >>> >>
> >>> >> > lsmod |grep ib
> >>> >>
> >>> >>
> >>>
> >>>
> >>>
> >>>
> >> _______________________________________________
> >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> >> To change your subscription (digest mode or unsubscribe) visit
> >> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> >>
> >> _______________________________________________
> >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> >> To change your subscription (digest mode or unsubscribe) visit
> >> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> >>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20190501/36f64238/attachment-0001.html>
More information about the Beowulf
mailing list