[Beowulf] Killing nodes with Open-MPI?

Ryan Novosielski novosirj at rutgers.edu
Thu Oct 26 09:30:29 PDT 2017


Where is this driver from? OS, or OFED, or?

We use primarily MVAPICH2 but I would be curious to try to duplicate this on our mlx5 equipment.

What model cards do you have?

--
____
|| \\UTGERS,       |---------------------------*O*---------------------------
||_// the State     |         Ryan Novosielski - novosirj at rutgers.edu<mailto:novosirj at rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ     | Office of Advanced Research Computing - MSB C630, Newark
    `'

On Oct 26, 2017, at 07:43, Chris Samuel <samuel at unimelb.edu.au<mailto:samuel at unimelb.edu.au>> wrote:

Hi folks,

I'm helping another group out and we've found that running an Open-MPI
program, even just a singleton, will kill nodes with Mellanox ConnectX 4 and 5
cards using RoCE (the mlx5 driver).   The node just locks up hard with no OOPS
or other diagnostics and has to be power cycled.

Disabling openib/verbs support with:

export OMPI_MCA_btl=tcp,self,vader

stops the crashes, and whilst it's hard to tell strace seems to imply it hangs
when trying to probe for openib/verbs devices (or shortly after).

Nodes with ConnectX-3 cards (mlx4 driver) don't seem to have the issue and I'm
reasonably convinced this has to be a driver bug, or perhaps a bad interaction
with recent 4.11.x and 4.12.x kernels (they need those for CephFS).

They've got a bug open with Mellanox already but I was wondering if anyone
else had seen anything similar?

cheers!
Chris
--
Christopher Samuel        Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: samuel at unimelb.edu.au<mailto:samuel at unimelb.edu.au> Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org<mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.beowulf.org%2Fmailman%2Flistinfo%2Fbeowulf&data=02%7C01%7Cnovosirj%40rutgers.edu%7C919d4d1a79fe443eaa1608d51c66c114%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636446150021038393&sdata=ZTHOeZxgYMtG7XVnZJw3BebEz4rypdmkCuW3ZVraLiQ%3D&reserved=0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20171026/b9170327/attachment.html>


More information about the Beowulf mailing list