[Beowulf] Anyone with really large clusters seeing memory leaks
with OFED 1.5 for tcp based apps?
landman at scalableinformatics.com
Sat Jan 30 21:38:24 PST 2010
Trying to trace something annoying down, and see if we are running
into something that is known.
OFED 1.5 on a 18.104.22.168 kernel. Running a file system atop IPoIB
(many reasons, none I care to get into here at the moment). Under light
load, the file system gradually grabs memory. Possibly a leak, not
entirely sure. Could be the OFED stack underneath. Backing file system
is xfs. That is has been (on this hardware in other situations) rock
solid stable. Here, xfs, OFED/IPoIB all toss their cookies (and fail
allocations) under moderate to heavy load.
Working with the file system vendor on this. I am not sure we have
the answer nailed, so I wanted to see who out there is running a big (
>512 nodes) cluster, doing large data transfers (preferably over
IPoIB), for data storage, and running a late model OFED. If you fall
into this category, please let me know, as I'd like to ask a few
questions offline about any observed OFED/IPoIB failure modes. I am not
convinced it is OFED/IPoIB, but I'd like to see what other people have
run into ... if anything.
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web : http://scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf