[Beowulf] IB problem/using IB diagnostics
Gus Correa
gus at ldeo.columbia.edu
Fri Jun 19 09:14:52 PDT 2009
Prentice Bisbal wrote:
> John Hearns wrote:
>>
>> 2009/6/18 Prentice Bisbal <prentice at ias.edu <mailto:prentice at ias.edu>>
>>
>> John Hearns wrote:
>> > Can you log into node36 and run ibstat or ibstatus?
>> >
>>
>> Looks good to me!
>> Links are up and it sees a subnet manager. As Greg says, looks like
>> something wonky in the script which is reporting
>> the node status??
>
> It's actually an MPI job (HPL using OpenMPI) which is reporting the
> problem.
>
> The head scratching continues...
>
Hi Prentice, list
Just in case you haven't seen this ...
Are you using OpenMPI 1.3.0 or 1.3.1?
Those versions have a memory leak bug when using IB.
The solution for the memory leak is to upgrade to 1.3.2.
A workaround is to use -mca mpi_leave_pinned=0.
See:
http://www.open-mpi.org/community/lists/announce/2009/04/0030.php
https://svn.open-mpi.org/trac/ompi/ticket/1853
My HPL with OpenMPI 1.3.1 crashed when using lots of memory.
I upgraded to 1.3.2, which fixed the problem,
and I haven't looked at the error messages,
so your problem may be different.
However, memory leaks can produce weird errors, hard to diagnose.
My $0.02.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------
More information about the Beowulf
mailing list