[Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

Faraz Hussain info at feacluster.com
Thu May 2 10:50:02 PDT 2019


Thanks John. I believe we purchased the enclosure from HPe with only  
hardware support. I am not aware of any support contract with  
Mellanox. We are running RHEL 7.5 ( I may have accidentally said it  
was Cent OS, but that was a typo )..

I am more the application guy. We have a hardware/networking  
sys.admin.. Lots of good information in your post that I'll discuss  
with our sys.admin.

Quoting John Hearns <hearnsj at googlemail.com>:

> You ask some damned good questions there.
> I will try to answer them from the point of view of someone who has worked
> as an HPC systems integrator and supported HPC systems,
> both for systems integrators and within companies.
>
> We will start with HP. Did you buy those systems direct from HP as servers,
> or did you buy a configured HPC system,
> complete with Infiniband networking and with a software stack?
> If you bought bare metal servers then you are out of luck regarding
> support, other than hardware failures.
> HP now incorporate SGI, and their support is fantastic. Great people work
> for HP and SGI. But they aren't responsible for your install.
>
> If however you bought an integrated HPC system this will normally be
> integrated by a smaller company, usually in your country.
> Is this the case here?  Then yes the integrator should be providing support.
> HOWEVER you have elected to remove their installed OS and upgrade by
> yourself. If I was the integrator I would give advice,
> but refuse to support the upgrade unless it was recommended by us, and you
> have a continuing support contract.
>
> You are using CentOS. The CentOS team are great guys - I know the founder
> quite well, and know people who work for RedHat.
> You have chosen CentOS - Community Supported Operating System. Join the
> CentOS HPC SIG perhaps and ask for help.
> But you don't get support from RedHat - as you are not using Redhat
> Enterprise Linux.
>
> Now we come to Mellanox. Mellanox support is fantastic. Formally, to open a
> support ticket with them you will need a support agreement
> on your switch. You HAVE got a support agreement - right?
> If not I have found that informal requests for support are often answered
> by Mellanox support.
>
> Failing all of those you could hire me!
> (I am being semi-serious here - I am a permanent employee at the moment,
> but I have worked as an HPC contractor int he past,
> and if I could justify it I would prefer to do HPC support on a contract
> basis).
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Thu, 2 May 2019 at 16:45, Faraz Hussain <info at feacluster.com> wrote:
>
>> Thanks. Before I go down the path of installing things willy-nilly, is
>> there some guide I should be following instead? I obviously have a
>> problem with my mellanox drivers combined with "user error"..
>>
>> So should I be paying Mellanox to help? Or is it a RedHat issue? Or is
>> it our harware vendor, HP who should be involved??
>>
>> Looks like I need support on how to get support :-)
>>
>>
>> Quoting Christopher Samuel <chris at csamuel.org>:
>>
>> >> root at lustwzb34:/root # systemctl status rdma
>> >> Unit rdma.service could not be found.
>> >
>> > You're missing this RPM then, which might explain a lot:
>> >
>> > $ rpm -qi rdma-core
>> > Name        : rdma-core
>> > Version     : 17.2
>> > Release     : 3.el7
>> > Architecture: x86_64
>> > Install Date: Tue 04 Dec 2018 03:58:16 PM AEDT
>> > Group       : Unspecified
>> > Size        : 107924
>> > License     : GPLv2 or BSD
>> > Signature   : RSA/SHA256, Tue 13 Nov 2018 01:45:22 AM AEDT, Key ID
>> > 24c6a8a7f4a80eb5
>> > Source RPM  : rdma-core-17.2-3.el7.src.rpm
>> > Build Date  : Wed 31 Oct 2018 07:10:24 AM AEDT
>> > Build Host  : x86-01.bsys.centos.org
>> > Relocations : (not relocatable)
>> > Packager    : CentOS BuildSystem <http://bugs.centos.org>
>> > Vendor      : CentOS
>> > URL         : https://github.com/linux-rdma/rdma-core
>> > Summary     : RDMA core userspace libraries and daemons
>> > Description :
>> > RDMA core userspace infrastructure and documentation, including
>> initscripts,
>> > kernel driver-specific modprobe override configs, IPoIB network scripts,
>> > dracut rules, and the rdma-ndd utility.
>> >
>> > --
>> >   Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>> > _______________________________________________
>> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> > To change your subscription (digest mode or unsubscribe) visit
>> > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>>
>>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>>





More information about the Beowulf mailing list