<div dir="ltr">Hi Chris,<div>We are running CX4 cards and have had some issues as well. Which version/s of openmpi are they running?</div><div><br></div><div>If you follow the instructions from Mellanox and run with yalla and mxm that works(ish) of openmpi 1.10.3, including setting the appropriate environment variables or config file.</div><div><br></div><div>If they are running the 2.1 series from openmpi there are some issues with compiling in the mellanox drivers.</div><div><br></div><div>We haven't seen any hard locks like this but we have seen a whole bundle of other issues.</div></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr">Cheers,<br><br>Lance<br>--<br>Dr Lance Wilson<br>Characterisation Virtual Laboratory (CVL) Coordinator &</div><div dir="ltr">Senior HPC Consultant</div><div>Ph: 03 99055942 (+61 3 99055942)</div><div dir="ltr">Mobile: 0437414123 (+61 4 3741 4123)</div><div dir="ltr">Multi-modal Australian ScienceS Imaging and Visualisation Environment<br>(<a href="http://www.massive.org.au/" rel="noreferrer" style="color:rgb(17,85,204)" target="_blank">www.massive.org.au</a>)<br>Monash University<br></div></div></div></div></div></div></div></div></div>
<br><div class="gmail_quote">On 26 October 2017 at 22:42, Chris Samuel <span dir="ltr"><<a href="mailto:samuel@unimelb.edu.au" target="_blank">samuel@unimelb.edu.au</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi folks,<br>
<br>
I'm helping another group out and we've found that running an Open-MPI<br>
program, even just a singleton, will kill nodes with Mellanox ConnectX 4 and 5<br>
cards using RoCE (the mlx5 driver). The node just locks up hard with no OOPS<br>
or other diagnostics and has to be power cycled.<br>
<br>
Disabling openib/verbs support with:<br>
<br>
export OMPI_MCA_btl=tcp,self,vader<br>
<br>
stops the crashes, and whilst it's hard to tell strace seems to imply it hangs<br>
when trying to probe for openib/verbs devices (or shortly after).<br>
<br>
Nodes with ConnectX-3 cards (mlx4 driver) don't seem to have the issue and I'm<br>
reasonably convinced this has to be a driver bug, or perhaps a bad interaction<br>
with recent 4.11.x and 4.12.x kernels (they need those for CephFS).<br>
<br>
They've got a bug open with Mellanox already but I was wondering if anyone<br>
else had seen anything similar?<br>
<br>
cheers!<br>
<span class="HOEnZb"><font color="#888888">Chris<br>
--<br>
Christopher Samuel Senior Systems Administrator<br>
Melbourne Bioinformatics - The University of Melbourne<br>
Email: <a href="mailto:samuel@unimelb.edu.au">samuel@unimelb.edu.au</a> Phone: <a href="tel:%2B61%20%280%293%20903%2055545" value="+61390355545">+61 (0)3 903 55545</a><br>
<br>
______________________________<wbr>_________________<br>
Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">http://www.beowulf.org/<wbr>mailman/listinfo/beowulf</a><br>
</font></span></blockquote></div><br></div>