> It's actually an MPI job (HPL using OpenMPI) which is reporting the
> problem.
> The head scratching continues...

It seems, from the ongoing discussion , that you do not have a hw problem,
but an (open)MPI one; I  have seen openMPI failing because some user-level
(or kernel; in my case it was user) verbs/etc. library missing.


1) check the job runs, with say,  -mca btl ^udapl (exclude UDAPL and see
if it runs) or  e.g., -mca btl openib,tcp,sm,self


2) more tediously, check that all libraries present in a non-failing node
are available in the failing one... There is a "Getting Started with
InfiniBand" page which has the names of the libraries/products that you
should have loaded to have a fully functioning IB stack - it solved my
problem :-)



