<div dir="ltr"><div>You ask some damned good questions there.</div><div>I will try to answer them from the point of view of someone who has worked as an HPC systems integrator and supported HPC systems,</div><div>both for systems integrators and within companies.</div><div><br></div><div>We will start with HP. Did you buy those systems direct from HP as servers, or did you buy a configured HPC system, </div><div>complete with Infiniband networking and with a software stack?</div><div>If you bought bare metal servers then you are out of luck regarding support, other than hardware failures.</div><div>HP now incorporate SGI, and their support is fantastic. Great people work for HP and SGI. But they aren't responsible for your install.</div><div><br></div><div>If however you bought an integrated HPC system this will normally be integrated by a smaller company, usually in your country.</div><div>Is this the case here? Then yes the integrator should be providing support.</div><div>HOWEVER you have elected to remove their installed OS and upgrade by yourself. If I was the integrator I would give advice,</div><div>but refuse to support the upgrade unless it was recommended by us, and you have a continuing support contract.</div><div><br></div><div>You are using CentOS. The CentOS team are great guys - I know the founder quite well, and know people who work for RedHat.</div><div>You have chosen CentOS - Community Supported Operating System. Join the CentOS HPC SIG perhaps and ask for help.</div><div>But you don't get support from RedHat - as you are not using Redhat Enterprise Linux.</div><div><br></div><div>Now we come to Mellanox. Mellanox support is fantastic. Formally, to open a support ticket with them you will need a support agreement</div><div>on your switch. You HAVE got a support agreement - right?</div><div>If not I have found that informal requests for support are often answered by Mellanox support.</div><div><br></div><div>Failing all of those you could hire me!</div><div>(I am being semi-serious here - I am a permanent employee at the moment, but I have worked as an HPC contractor int he past,</div><div>and if I could justify it I would prefer to do HPC support on a contract basis).</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div class="gmail_attr" dir="ltr">On Thu, 2 May 2019 at 16:45, Faraz Hussain <<a href="mailto:info@feacluster.com">info@feacluster.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid">Thanks. Before I go down the path of installing things willy-nilly, is <br>
there some guide I should be following instead? I obviously have a <br>
problem with my mellanox drivers combined with "user error"..<br>
<br>
So should I be paying Mellanox to help? Or is it a RedHat issue? Or is <br>
it our harware vendor, HP who should be involved??<br>
<br>
Looks like I need support on how to get support :-)<br>
<br>
<br>
Quoting Christopher Samuel <<a href="mailto:chris@csamuel.org" target="_blank">chris@csamuel.org</a>>:<br>
<br>
>> root@lustwzb34:/root # systemctl status rdma<br>
>> Unit rdma.service could not be found.<br>
><br>
> You're missing this RPM then, which might explain a lot:<br>
><br>
> $ rpm -qi rdma-core<br>
> Name : rdma-core<br>
> Version : 17.2<br>
> Release : 3.el7<br>
> Architecture: x86_64<br>
> Install Date: Tue 04 Dec 2018 03:58:16 PM AEDT<br>
> Group : Unspecified<br>
> Size : 107924<br>
> License : GPLv2 or BSD<br>
> Signature : RSA/SHA256, Tue 13 Nov 2018 01:45:22 AM AEDT, Key ID <br>
> 24c6a8a7f4a80eb5<br>
> Source RPM : rdma-core-17.2-3.el7.src.rpm<br>
> Build Date : Wed 31 Oct 2018 07:10:24 AM AEDT<br>
> Build Host : <a href="http://x86-01.bsys.centos.org" target="_blank" rel="noreferrer">x86-01.bsys.centos.org</a><br>
> Relocations : (not relocatable)<br>
> Packager : CentOS BuildSystem <<a href="http://bugs.centos.org" target="_blank" rel="noreferrer">http://bugs.centos.org</a>><br>
> Vendor : CentOS<br>
> URL : <a href="https://github.com/linux-rdma/rdma-core" target="_blank" rel="noreferrer">https://github.com/linux-rdma/rdma-core</a><br>
> Summary : RDMA core userspace libraries and daemons<br>
> Description :<br>
> RDMA core userspace infrastructure and documentation, including initscripts,<br>
> kernel driver-specific modprobe override configs, IPoIB network scripts,<br>
> dracut rules, and the rdma-ndd utility.<br>
><br>
> -- <br>
> Chris Samuel : <a href="http://www.csamuel.org/" target="_blank" rel="noreferrer">http://www.csamuel.org/</a> : Berkeley, CA, USA<br>
> _______________________________________________<br>
> Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
> To change your subscription (digest mode or unsubscribe) visit <br>
> <a href="https://beowulf.org/cgi-bin/mailman/listinfo/beowulf" target="_blank" rel="noreferrer">https://beowulf.org/cgi-bin/mailman/listinfo/beowulf</a><br>
<br>
<br>
<br>
_______________________________________________<br>
Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
To change your subscription (digest mode or unsubscribe) visit <a href="https://beowulf.org/cgi-bin/mailman/listinfo/beowulf" target="_blank" rel="noreferrer">https://beowulf.org/cgi-bin/mailman/listinfo/beowulf</a><br>
</blockquote></div>