<div dir="ltr"><div>Warming to my subject now. I really dont want to be specific about any vendor, or cluster management package.</div><div>As I say I have had experience ranging from national contracts, currently at a company with tens of thousands of cpus worldwide,</div><div>down to installing half rack HPC clusters for customers, and informally supporting half rack sized clusters where the users did not have formal support.</div><div><br></div><div>When systems are bought the shiny bit is the hardware - much is made of the latest generation CPUs, GPUS etc.</div><div>Buyers try to get as much hardare as they can for the price - usually ending up as CPU core count or HPL performance.</div><div>They will swallow support contracts as they dont want to have a big failure and have their management (Academic or industrial)</div><div>asking what the heck just happened and why the heck you are running without support.</div><div>The hardware support is provided by the vendors, and their regional distributors. </div><div>So from the point of view of a systems vendor hardware support is the responsibility of the distributor or hardware vendor.</div><div><br></div><div>What DOES get squeezed is the HPC software stack support and the applications level support.</div><div>After all - how hard can it be? </div><div>The sales guys told me that Intel now has 256 core processors with built in AI which will run any software faster</div><div>then you can type 'run'.</div><div>The new guy with the beard has a laptop which uses this Ubuntu operating system - and its all free.</div><div>Why do we need to pay $$$ for this cluster OS?</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div class="gmail_attr" dir="ltr">On Thu, 2 May 2019 at 17:18, John Hearns <<a href="mailto:hearnsj@googlemail.com">hearnsj@googlemail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid"><div dir="ltr"><div>Chris, I have to say this. I have worked for smaller companies, and have worked for cluster integrators.</div><div>For big University sized and national labs the procurement exercise will end up with a well defined support arrangement.</div><div><br></div><div>I have seen, in once company I worked at, an HPC system arrive which I was not responsible for.</div><div>This system was purchased by the IT department, and was intended to run Finite Element software.</div><div>The hardware came from a Tier 1 vendor, but it was integrated by a small systems integrator.</div><div>Yes, they installed a software stack and demonstrated that it would run Abaqus.</div><div>But beyond that there was no support for getting other applications running. And no training that I could see in diagnosing faults.</div><div><br></div><div>I am not going to name names, but I suspect experiences like that are common.</div><div>Companies want to procure kit for as little as possible. Tier 1 vendors and white box vendors want to make the sales.</div><div>But no-one wants to pay for Bright Cluster Manager, for example.</div><div>So the end user gets at best a freeware solution like Rocks, or at worst some Kickstarted setup which installs an OS,</div><div>the CentOS supplied IB drivers and MPI, and Gridengine slapped on top of that.</div><div><br></div><div>This leads to an unsatisfying experience on the part of the end users, and also for the engineers of the integrating company.</div><div><br></div><div>Which leads me to say that we see the rise of HPC in the cloud services- AWS, OnScale, Rescale, Verne Global etc. etc.</div><div>And no wonder - you should be getting a much more polished and ready to go infrastructure, even though you cant physically touch it.</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div class="gmail_attr" dir="ltr">On Thu, 2 May 2019 at 17:08, Christopher Samuel <<a href="mailto:chris@csamuel.org" target="_blank">chris@csamuel.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid">On 5/2/19 8:40 AM, Faraz Hussain wrote:<br>
<br>
> So should I be paying Mellanox to help? Or is it a RedHat issue? Or is <br>
> it our harware vendor, HP who should be involved??<br>
<br>
I suspect that would be set out in the contract for the HP system.<br>
<br>
The clusters I've been involved in purchasing in the past have always <br>
required support requests to go via the immediate vendor and they then <br>
arrange to put you in contact with others where required.<br>
<br>
All the best,<br>
Chris<br>
-- <br>
Chris Samuel : <a href="http://www.csamuel.org/" target="_blank" rel="noreferrer">http://www.csamuel.org/</a> : Berkeley, CA, USA<br>
_______________________________________________<br>
Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
To change your subscription (digest mode or unsubscribe) visit <a href="https://beowulf.org/cgi-bin/mailman/listinfo/beowulf" target="_blank" rel="noreferrer">https://beowulf.org/cgi-bin/mailman/listinfo/beowulf</a><br>
</blockquote></div>
</blockquote></div>