[Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

John Hearns hearnsj at googlemail.com
Thu May 2 09:02:38 PDT 2019


You ask some damned good questions there.
I will try to answer them from the point of view of someone who has worked
as an HPC systems integrator and supported HPC systems,
both for systems integrators and within companies.

We will start with HP. Did you buy those systems direct from HP as servers,
or did you buy a configured HPC system,
complete with Infiniband networking and with a software stack?
If you bought bare metal servers then you are out of luck regarding
support, other than hardware failures.
HP now incorporate SGI, and their support is fantastic. Great people work
for HP and SGI. But they aren't responsible for your install.

If however you bought an integrated HPC system this will normally be
integrated by a smaller company, usually in your country.
Is this the case here?  Then yes the integrator should be providing support.
HOWEVER you have elected to remove their installed OS and upgrade by
yourself. If I was the integrator I would give advice,
but refuse to support the upgrade unless it was recommended by us, and you
have a continuing support contract.

You are using CentOS. The CentOS team are great guys - I know the founder
quite well, and know people who work for RedHat.
You have chosen CentOS - Community Supported Operating System. Join the
CentOS HPC SIG perhaps and ask for help.
But you don't get support from RedHat - as you are not using Redhat
Enterprise Linux.

Now we come to Mellanox. Mellanox support is fantastic. Formally, to open a
support ticket with them you will need a support agreement
on your switch. You HAVE got a support agreement - right?
If not I have found that informal requests for support are often answered
by Mellanox support.

Failing all of those you could hire me!
(I am being semi-serious here - I am a permanent employee at the moment,
but I have worked as an HPC contractor int he past,
and if I could justify it I would prefer to do HPC support on a contract
basis).
































On Thu, 2 May 2019 at 16:45, Faraz Hussain <info at feacluster.com> wrote:

> Thanks. Before I go down the path of installing things willy-nilly, is
> there some guide I should be following instead? I obviously have a
> problem with my mellanox drivers combined with "user error"..
>
> So should I be paying Mellanox to help? Or is it a RedHat issue? Or is
> it our harware vendor, HP who should be involved??
>
> Looks like I need support on how to get support :-)
>
>
> Quoting Christopher Samuel <chris at csamuel.org>:
>
> >> root at lustwzb34:/root # systemctl status rdma
> >> Unit rdma.service could not be found.
> >
> > You're missing this RPM then, which might explain a lot:
> >
> > $ rpm -qi rdma-core
> > Name        : rdma-core
> > Version     : 17.2
> > Release     : 3.el7
> > Architecture: x86_64
> > Install Date: Tue 04 Dec 2018 03:58:16 PM AEDT
> > Group       : Unspecified
> > Size        : 107924
> > License     : GPLv2 or BSD
> > Signature   : RSA/SHA256, Tue 13 Nov 2018 01:45:22 AM AEDT, Key ID
> > 24c6a8a7f4a80eb5
> > Source RPM  : rdma-core-17.2-3.el7.src.rpm
> > Build Date  : Wed 31 Oct 2018 07:10:24 AM AEDT
> > Build Host  : x86-01.bsys.centos.org
> > Relocations : (not relocatable)
> > Packager    : CentOS BuildSystem <http://bugs.centos.org>
> > Vendor      : CentOS
> > URL         : https://github.com/linux-rdma/rdma-core
> > Summary     : RDMA core userspace libraries and daemons
> > Description :
> > RDMA core userspace infrastructure and documentation, including
> initscripts,
> > kernel driver-specific modprobe override configs, IPoIB network scripts,
> > dracut rules, and the rdma-ndd utility.
> >
> > --
> >   Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit
> > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20190502/a4755da9/attachment.html>


More information about the Beowulf mailing list