<div dir="ltr"><div>Errrr..   you are not running a subnet manager?</div><div>DO you have an Infiniband switch or are you connecting two servers back-to-back?</div><div><br></div><div>Also - have you considered using OpenHPC rather tyhan installing CentOS on two servers?</div><div>When you expand this manual installation is going to be painful.</div></div><br><div class="gmail_quote"><div class="gmail_attr" dir="ltr">On Wed, 1 May 2019 at 15:05, Faraz Hussain <<a href="mailto:info@feacluster.com">info@feacluster.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid">> What hardware and what Infiniband switch you have<br>
> Run   these commands:      ibdiagnet   smshow<br>
<br>
Unfortunately ibdiagnet seems to give some errors:<br>
<br>
[hussaif1@lustwzb34 ~]$ ibdiagnet<br>
----------<br>
Load Plugins from:<br>
/usr/share/ibdiagnet2.1.1/plugins/<br>
(You can specify more paths to be looked in with  <br>
"IBDIAGNET_PLUGINS_PATH" env variable)<br>
<br>
Plugin Name                                   Result     Comment<br>
libibdiagnet_cable_diag_plugin-2.1.1          Succeeded  Plugin loaded<br>
libibdiagnet_phy_diag_plugin-2.1.1            Succeeded  Plugin loaded<br>
<br>
---------------------------------------------<br>
Discovery<br>
-E- Failed to initialize<br>
<br>
-E- Fabric Discover failed, err=IBDiag initialize wasn't done<br>
-E- Fabric Discover failed, MAD err=Failed to umad_open_port<br>
<br>
---------------------------------------------<br>
Summary<br>
-I- Stage                     Warnings   Errors     Comment<br>
-I- Discovery                                       NA<br>
-I- Lids Check                                      NA<br>
-I- Links Check                                     NA<br>
-I- Subnet Manager                                  NA<br>
-I- Port Counters                                   NA<br>
-I- Nodes Information                               NA<br>
-I- Speed / Width checks                            NA<br>
-I- Partition Keys                                  NA<br>
-I- Alias GUIDs                                     NA<br>
-I- Temperature Sensing                             NA<br>
<br>
-I- You can find detailed errors/warnings in:  <br>
/var/tmp/ibdiagnet2/ibdiagnet2.log<br>
<br>
-E- A fatal error occurred, exiting...<br>
<br>
<br>
I do not have smshow command , but I see there is an sminfo. It also  <br>
give this error:<br>
<br>
[hussaif1@lustwzb34 ~]$ smshow<br>
bash: smshow: command not found...<br>
[hussaif1@lustwzb34 ~]$ sm<br>
smartctl     smbcacls     smbcquotas   smbspool     smbtree       <br>
sm-notify    smpdump      smtp-sink<br>
smartd       smbclient    smbget       smbtar       sminfo        <br>
smparquery   smpquery     smtp-source<br>
[hussaif1@lustwzb34 ~]$ sminfo<br>
ibwarn: [10407] mad_rpc_open_port: can't open UMAD port ((null):0)<br>
sminfo: iberror: failed: Failed to open '(null)' port '0'<br>
<br>
<br>
<br>
> You originally had the OpenMPI which was provided by CentOS  ??<br>
<br>
Correct.<br>
<br>
> You compiled the OpenMPI from source??<br>
<br>
Yes, I then compiled it from source and it seems to work ( at least  <br>
give reasonable numbers when running latency and bandwith tests )..<br>
<br>
> How are you bringing the new OpenMPI version itno your PATH ?? Are you<br>
> using modules or an mpi switcher utilioty?<br>
<br>
Just as follows:<br>
<br>
export PATH=/Apps/users/hussaif1/openmpi-4.0.0/bin:$PATH<br>
<br>
Thanks!<br>
<br>
><br>
> On Wed, 1 May 2019 at 09:39, Benson Muite <<a href="mailto:benson_muite@emailplus.org" target="_blank">benson_muite@emailplus.org</a>><br>
> wrote:<br>
><br>
>> Hi Faraz,<br>
>><br>
>> Have you tried any other MPI distributions (eg. MPICH, MVAPICH)?<br>
>><br>
>> Regards,<br>
>><br>
>> Benson<br>
>> On 4/30/19 11:20 PM, Gus Correa wrote:<br>
>><br>
>> It may be using IPoIB (TCP/IP over IB), not verbs/rdma.<br>
>> You can force it to use openib (verbs, rdma) with (vader is for in-node<br>
>> shared memory):<br>
>><br>
>> mpirun --mca btl openib,self,vader ...<br>
>><br>
>><br>
>> These flags may also help tell which btl (byte transport layer) is  <br>
>> being used:<br>
>><br>
>>  --mca btl_base_verbose 30<br>
>><br>
>> See these  <br>
>> FAQ:<a href="https://www.open-mpi.org/faq/?category=openfabrics#ib-btlhttps://www.open-mpi.org/faq/?category=all%23tcp-routability-1.3" target="_blank" rel="noreferrer">https://www.open-mpi.org/faq/?category=openfabrics#ib-btlhttps://www.open-mpi.org/faq/?category=all#tcp-routability-1.3</a><br>
>><br>
>> Better really ask more details in the Open MPI list. They are the pros!<br>
>><br>
>> My two cents,<br>
>> Gus Correa<br>
>><br>
>><br>
>><br>
>> On Tue, Apr 30, 2019 at 3:57 PM Faraz Hussain <<a href="mailto:info@feacluster.com" target="_blank">info@feacluster.com</a>> wrote:<br>
>><br>
>>> Thanks, after buidling openmpi 4 from source, it now works! However it<br>
>>> still gives this message below when I run openmpi with verbose setting:<br>
>>><br>
>>> No OpenFabrics connection schemes reported that they were able to be<br>
>>> used on a specific port.  As such, the openib BTL (OpenFabrics<br>
>>> support) will be disabled for this port.<br>
>>><br>
>>>    Local host:           lustwzb34<br>
>>>    Local device:         mlx4_0<br>
>>>    Local port:           1<br>
>>>    CPCs attempted:       rdmacm, udcm<br>
>>><br>
>>> However, the results from my latency and bandwith tests seem to be<br>
>>> what I would expect from infiniband. See:<br>
>>><br>
>>> [hussaif1@lustwzb34 pt2pt]$  mpirun -v -np 2 -hostfile ./hostfile<br>
>>> ./osu_latency<br>
>>> # OSU MPI Latency Test v5.3.2<br>
>>> # Size          Latency (us)<br>
>>> 0                       1.87<br>
>>> 1                       1.88<br>
>>> 2                       1.93<br>
>>> 4                       1.92<br>
>>> 8                       1.93<br>
>>> 16                      1.95<br>
>>> 32                      1.93<br>
>>> 64                      2.08<br>
>>> 128                     2.61<br>
>>> 256                     2.72<br>
>>> 512                     2.93<br>
>>> 1024                    3.33<br>
>>> 2048                    3.81<br>
>>> 4096                    4.71<br>
>>> 8192                    6.68<br>
>>> 16384                   8.38<br>
>>> 32768                  12.13<br>
>>> 65536                  19.74<br>
>>> 131072                 35.08<br>
>>> 262144                 64.67<br>
>>> 524288                122.11<br>
>>> 1048576               236.69<br>
>>> 2097152               465.97<br>
>>> 4194304               926.31<br>
>>><br>
>>> [hussaif1@lustwzb34 pt2pt]$  mpirun -v -np 2 -hostfile ./hostfile<br>
>>> ./osu_bw<br>
>>> # OSU MPI Bandwidth Test v5.3.2<br>
>>> # Size      Bandwidth (MB/s)<br>
>>> 1                       3.09<br>
>>> 2                       6.35<br>
>>> 4                      12.77<br>
>>> 8                      26.01<br>
>>> 16                     51.31<br>
>>> 32                    103.08<br>
>>> 64                    197.89<br>
>>> 128                   362.00<br>
>>> 256                   676.28<br>
>>> 512                  1096.26<br>
>>> 1024                 1819.25<br>
>>> 2048                 2551.41<br>
>>> 4096                 3886.63<br>
>>> 8192                 3983.17<br>
>>> 16384                4362.30<br>
>>> 32768                4457.09<br>
>>> 65536                4502.41<br>
>>> 131072               4512.64<br>
>>> 262144               4531.48<br>
>>> 524288               4537.42<br>
>>> 1048576              4510.69<br>
>>> 2097152              4546.64<br>
>>> 4194304              4565.12<br>
>>><br>
>>> When I run ibv_devinfo I get:<br>
>>><br>
>>> [hussaif1@lustwzb34 pt2pt]$ ibv_devinfo<br>
>>> hca_id: mlx4_0<br>
>>>          transport:                      InfiniBand (0)<br>
>>>          fw_ver:                         2.36.5000<br>
>>>          node_guid:                      480f:cfff:fff5:c6c0<br>
>>>          sys_image_guid:                 480f:cfff:fff5:c6c3<br>
>>>          vendor_id:                      0x02c9<br>
>>>          vendor_part_id:                 4103<br>
>>>          hw_ver:                         0x0<br>
>>>          board_id:                       HP_1360110017<br>
>>>          phys_port_cnt:                  2<br>
>>>          Device ports:<br>
>>>                  port:   1<br>
>>>                          state:                  PORT_ACTIVE (4)<br>
>>>                          max_mtu:                4096 (5)<br>
>>>                          active_mtu:             1024 (3)<br>
>>>                          sm_lid:                 0<br>
>>>                          port_lid:               0<br>
>>>                          port_lmc:               0x00<br>
>>>                          link_layer:             Ethernet<br>
>>><br>
>>>                  port:   2<br>
>>>                          state:                  PORT_DOWN (1)<br>
>>>                          max_mtu:                4096 (5)<br>
>>>                          active_mtu:             1024 (3)<br>
>>>                          sm_lid:                 0<br>
>>>                          port_lid:               0<br>
>>>                          port_lmc:               0x00<br>
>>>                          link_layer:             Ethernet<br>
>>><br>
>>> I will ask the openmpi mailing list if my results make sense?!<br>
>>><br>
>>><br>
>>> Quoting Gus Correa <<a href="mailto:gus@ldeo.columbia.edu" target="_blank">gus@ldeo.columbia.edu</a>>:<br>
>>><br>
>>> > Hi Faraz<br>
>>> ><br>
>>> > By all means, download the Open MPI tarball and build from source.<br>
>>> > Otherwise there won't be support for IB (the CentOS Open MPI packages<br>
>>> most<br>
>>> > likely rely only on TCP/IP).<br>
>>> ><br>
>>> > Read their README file (it comes in the tarball), and take a careful<br>
>>> look<br>
>>> > at their (excellent) FAQ:<br>
>>> > <a href="https://www.open-mpi.org/faq/" target="_blank" rel="noreferrer">https://www.open-mpi.org/faq/</a><br>
>>> > Many issues can be solved by just reading these two resources.<br>
>>> ><br>
>>> > If you hit more trouble, subscribe to the Open MPI mailing list, and ask<br>
>>> > questions there,<br>
>>> > because you will get advice directly from the Open MPI developers, and<br>
>>> the<br>
>>> > fix will come easy.<br>
>>> > <a href="https://www.open-mpi.org/community/lists/ompi.php" target="_blank" rel="noreferrer">https://www.open-mpi.org/community/lists/ompi.php</a><br>
>>> ><br>
>>> > My two cents,<br>
>>> > Gus Correa<br>
>>> ><br>
>>> > On Tue, Apr 30, 2019 at 3:07 PM Faraz Hussain <<a href="mailto:info@feacluster.com" target="_blank">info@feacluster.com</a>><br>
>>> wrote:<br>
>>> ><br>
>>> >> Thanks, yes I have installed those libraries. See below. Initially I<br>
>>> >> installed the libraries via yum. But then I tried installing the rpms<br>
>>> >> directly from Mellanox website (<br>
>>> >> MLNX_OFED_LINUX-4.5-1.0.1.0-rhel7.5-x86_64.tar ). Even after doing<br>
>>> >> that, I still got the same error with openmpi. I will try your<br>
>>> >> suggestion of building openmpi from source next!<br>
>>> >><br>
>>> >> root@lustwzb34:/root # yum list | grep ibverbs<br>
>>> >> libibverbs.x86_64                     41mlnx1-OFED.4.5.0.1.0.45101<br>
>>> >> libibverbs-devel.x86_64               41mlnx1-OFED.4.5.0.1.0.45101<br>
>>> >> libibverbs-devel-static.x86_64        41mlnx1-OFED.4.5.0.1.0.45101<br>
>>> >> libibverbs-utils.x86_64               41mlnx1-OFED.4.5.0.1.0.45101<br>
>>> >> libibverbs.i686                       17.2-3.el7<br>
>>> >> rhel-7-server-rpms<br>
>>> >> libibverbs-devel.i686                 1.2.1-1.el7<br>
>>> >> rhel-7-server-rpms<br>
>>> >><br>
>>> >> root@lustwzb34:/root # lsmod | grep ib<br>
>>> >> ib_ucm                 22602  0<br>
>>> >> ib_ipoib              168425  0<br>
>>> >> ib_cm                  53141  3 rdma_cm,ib_ucm,ib_ipoib<br>
>>> >> ib_umad                22093  0<br>
>>> >> mlx5_ib               339961  0<br>
>>> >> ib_uverbs             121821  3 mlx5_ib,ib_ucm,rdma_ucm<br>
>>> >> mlx5_core             919178  2 mlx5_ib,mlx5_fpga_tools<br>
>>> >> mlx4_ib               211747  0<br>
>>> >> ib_core               294554  10<br>
>>> >><br>
>>> >><br>
>>> rdma_cm,ib_cm,iw_cm,mlx4_ib,mlx5_ib,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib<br>
>>> >> mlx4_core             360598  2 mlx4_en,mlx4_ib<br>
>>> >> mlx_compat             29012  15<br>
>>> >><br>
>>> >><br>
>>> rdma_cm,ib_cm,iw_cm,mlx4_en,mlx4_ib,mlx5_ib,mlx5_fpga_tools,ib_ucm,ib_core,ib_umad,ib_uverbs,mlx4_core,mlx5_core,rdma_ucm,ib_ipoib<br>
>>> >> devlink                42368  4 mlx4_en,mlx4_ib,mlx4_core,mlx5_core<br>
>>> >> libcrc32c              12644  3 xfs,nf_nat,nf_conntrack<br>
>>> >> root@lustwzb34:/root #<br>
>>> >><br>
>>> >><br>
>>> >><br>
>>> >> > Did you install libibverbs  (and libibverbs-utils, for information<br>
>>> and<br>
>>> >> > troubleshooting)?<br>
>>> >><br>
>>> >> > yum list |grep ibverbs<br>
>>> >><br>
>>> >> > Are you loading the ib modules?<br>
>>> >><br>
>>> >> > lsmod |grep ib<br>
>>> >><br>
>>> >><br>
>>><br>
>>><br>
>>><br>
>>><br>
>> _______________________________________________<br>
>> Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
>> To change your subscription (digest mode or unsubscribe) visit  <br>
>> <a href="https://beowulf.org/cgi-bin/mailman/listinfo/beowulf" target="_blank" rel="noreferrer">https://beowulf.org/cgi-bin/mailman/listinfo/beowulf</a><br>
>><br>
>> _______________________________________________<br>
>> Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
>> To change your subscription (digest mode or unsubscribe) visit<br>
>> <a href="https://beowulf.org/cgi-bin/mailman/listinfo/beowulf" target="_blank" rel="noreferrer">https://beowulf.org/cgi-bin/mailman/listinfo/beowulf</a><br>
>><br>
<br>
<br>
<br>
</blockquote></div>