[Beowulf] Anyone with really large clusters seeing memory leaks with OFED 1.5 for tcp based apps?

Joe Landman landman at scalableinformatics.com
Sat Jan 30 21:38:24 PST 2010

Hi folks

   Trying to trace something annoying down, and see if we are running 
into something that is known.

   OFED 1.5 on a kernel.  Running a file system atop IPoIB 
(many reasons, none I care to get into here at the moment).  Under light 
load, the file system gradually grabs memory.  Possibly a leak, not 
entirely sure.  Could be the OFED stack underneath.  Backing file system 
is xfs.  That is has been (on this hardware in other situations) rock 
solid stable.  Here, xfs, OFED/IPoIB all toss their cookies (and fail 
allocations) under moderate to heavy load.

   Working with the file system vendor on this.  I am not sure we have 
the answer nailed, so I wanted to see who out there is running a big ( 
 >512 nodes) cluster, doing large data transfers (preferably over 
IPoIB), for data storage, and running a late model OFED.  If you fall 
into this category, please let me know, as I'd like to ask a few 
questions offline about any observed OFED/IPoIB failure modes.  I am not 
convinced it is OFED/IPoIB, but I'd like to see what other people have 
run into ... if anything.



Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

