[Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?
John Hearns
hearnsj at googlemail.com
Thu May 2 09:07:35 PDT 2019
Pleas tell us the history of the overall system.
Was it bought as hardware only from a supplier? Or was it delivered as an
already set up system with operating system, applications, Infiniband
drivers etc?
I would also look at Qlustar
https://www.qlustar.com/book/qlustar/summary
and Bright https://www.brightcomputing.com/
Bright will certainly give you excellent support.
On Thu, 2 May 2019 at 17:02, John Hearns <hearnsj at googlemail.com> wrote:
> You ask some damned good questions there.
> I will try to answer them from the point of view of someone who has worked
> as an HPC systems integrator and supported HPC systems,
> both for systems integrators and within companies.
>
> We will start with HP. Did you buy those systems direct from HP as
> servers, or did you buy a configured HPC system,
> complete with Infiniband networking and with a software stack?
> If you bought bare metal servers then you are out of luck regarding
> support, other than hardware failures.
> HP now incorporate SGI, and their support is fantastic. Great people work
> for HP and SGI. But they aren't responsible for your install.
>
> If however you bought an integrated HPC system this will normally be
> integrated by a smaller company, usually in your country.
> Is this the case here? Then yes the integrator should be providing
> support.
> HOWEVER you have elected to remove their installed OS and upgrade by
> yourself. If I was the integrator I would give advice,
> but refuse to support the upgrade unless it was recommended by us, and you
> have a continuing support contract.
>
> You are using CentOS. The CentOS team are great guys - I know the founder
> quite well, and know people who work for RedHat.
> You have chosen CentOS - Community Supported Operating System. Join the
> CentOS HPC SIG perhaps and ask for help.
> But you don't get support from RedHat - as you are not using Redhat
> Enterprise Linux.
>
> Now we come to Mellanox. Mellanox support is fantastic. Formally, to open
> a support ticket with them you will need a support agreement
> on your switch. You HAVE got a support agreement - right?
> If not I have found that informal requests for support are often answered
> by Mellanox support.
>
> Failing all of those you could hire me!
> (I am being semi-serious here - I am a permanent employee at the moment,
> but I have worked as an HPC contractor int he past,
> and if I could justify it I would prefer to do HPC support on a contract
> basis).
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Thu, 2 May 2019 at 16:45, Faraz Hussain <info at feacluster.com> wrote:
>
>> Thanks. Before I go down the path of installing things willy-nilly, is
>> there some guide I should be following instead? I obviously have a
>> problem with my mellanox drivers combined with "user error"..
>>
>> So should I be paying Mellanox to help? Or is it a RedHat issue? Or is
>> it our harware vendor, HP who should be involved??
>>
>> Looks like I need support on how to get support :-)
>>
>>
>> Quoting Christopher Samuel <chris at csamuel.org>:
>>
>> >> root at lustwzb34:/root # systemctl status rdma
>> >> Unit rdma.service could not be found.
>> >
>> > You're missing this RPM then, which might explain a lot:
>> >
>> > $ rpm -qi rdma-core
>> > Name : rdma-core
>> > Version : 17.2
>> > Release : 3.el7
>> > Architecture: x86_64
>> > Install Date: Tue 04 Dec 2018 03:58:16 PM AEDT
>> > Group : Unspecified
>> > Size : 107924
>> > License : GPLv2 or BSD
>> > Signature : RSA/SHA256, Tue 13 Nov 2018 01:45:22 AM AEDT, Key ID
>> > 24c6a8a7f4a80eb5
>> > Source RPM : rdma-core-17.2-3.el7.src.rpm
>> > Build Date : Wed 31 Oct 2018 07:10:24 AM AEDT
>> > Build Host : x86-01.bsys.centos.org
>> > Relocations : (not relocatable)
>> > Packager : CentOS BuildSystem <http://bugs.centos.org>
>> > Vendor : CentOS
>> > URL : https://github.com/linux-rdma/rdma-core
>> > Summary : RDMA core userspace libraries and daemons
>> > Description :
>> > RDMA core userspace infrastructure and documentation, including
>> initscripts,
>> > kernel driver-specific modprobe override configs, IPoIB network scripts,
>> > dracut rules, and the rdma-ndd utility.
>> >
>> > --
>> > Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
>> > _______________________________________________
>> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>> Computing
>> > To change your subscription (digest mode or unsubscribe) visit
>> > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>>
>>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20190502/4050af1f/attachment-0001.html>
More information about the Beowulf
mailing list