memory leak

Wed Dec 18 12:29:59 PST 2002

I've been running the same version of Scyld's distribution since September.

I know there has to have been a change somewhere, but that's what I'm having
such a hard time tracking down.  The scripts handling the queuing system are
altered on a regular basis, but I can't for the life of me figure out why a
perl script would alter file caching - I, like everyone else <grin> am aware
that the memory isn't really being used.  Its purely a buffer thing - has to
be.  That, or either malicious code that is confusing the memory manager, or
a leak (only reason I think leak is because the gradual growth of it).

I have updated absolutely nothing on the cluster since Sept24, nor have I
made a single parameter change.  Nothing is different other than the files
on the nfs server, and possibly some settings on the nfs server itself
(though I can't figure out how any server settings would cause deterioration
over time, instead of relatively initial issues).

If I had changed anything, it would be really easy for me to point at and
figure out, and I wouldn't be frustrated with it.  I have changelogs that
I've reviewed,  On the cluster, there have been no updates, no installs, no
changes.

Whenever I install something, it is by one of two ways.  Either I install
from source, or I install from rpm.  The last rpm I installed, according to
the rpm database, was on September 10.  The newest makefile in either of the
src directories is from September 9.  According to "ls -alrtR|grep Nov" and
"ls -altrR|grep Dec" (and again for Oct) run from inside the /etc directory,
nothing has changed (sans the things that get changed at reboot, like mtab)
since Sept 24.

It really just doesn't make sense to me that a perl script, or even updating
a perl binary on an nfs server (and then using it on the cluster), could be
causing this problem.  The queue system has run nearly flawlessly for a year
and a half with this equiptment.  If that's possible though - how would I
determine why the simple perl script is doing it?

The queue system pulls jobs from an outside mysql server...it was updated on
the 6th (after this problem started).

I'm just looking for suggestions as to what to do other than reboot the
systems once a week.  If perl can. in fact, cause these types of problems
(how?) then hey...I can work with that.  But it truely seems to be a buffer
problem.  I can throw 500M into ram without it complaining, and without swap
ever being hit.  It just won't cache these files anymore, until I reboot the
systems.

Does that make sense?

Vmstat output below.  Wiglaf is the master, and it was last rebooted on the
6th.  Node 40 and 42 were both rebooted yesterday.  Node 42 was taken out of
the queue, and is just sitting there 100% idle (control-ish box).  Node 40
is currently idle, but is part of the queue.  There are no jobs (right now)
in the queue.  Despite the cache already showing high on 40, its performance
is still good....for now.  In a week, I'll need to reboot it, along with
everyone else.  If *I* knew the answer, I wouldn't be asking the question :)
I apologize if I seem frustrated - I am.

[root at wiglaf etc]# vmstat
   procs                      memory    swap          io     system
cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy
id
 0  0  0      0 544580 127004 2533160   0   0     0     2   10     2   1   1
19
[root at wiglaf etc]# bpsh 42 vmstat
   procs                      memory    swap          io     system
cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy
id
 0  0  0      0 2044748    884   5180   0   0     0     0   52     2   0   0
100
[root at wiglaf etc]# bpsh 40 vmstat
   procs                      memory    swap          io     system
cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy
id
 0  0  0      0  19020    884 2024524   0   0     0     0  111    73  29   3
68
[root at wiglaf etc]#

-----Original Message-----
From:	Harvey J. Stein [mailto:HJSTEIN at bloomberg.com]
Sent:	Wed 12/18/2002 9:13 AM
To:	Brian LaMere
Cc:	beowulf at beowulf.org
Subject:	Re: memory leak
"Brian LaMere" <blamere at diversa.com> writes:

 > It was not until a month ago that performance started becoming an issue,
 > however.  And it was not until yesterday that the cluster almost crippled
 > the NFS server.

Given that this started a month ago, I'm going to ask an obvious
question, which presumably you've already checked, but you didn't
mention it in your messages.  Has any software or hardware on the
machines changed in this period?  This includes kernels, libs, apps,
configs, network cards, etc. on the NFS server, the cluster machines,
the routers/hubs, DNS, etc.  Has the mix of jobs or the jobs
themselves running on the cluster changed?  I'd do "find / -mtime -90"
& especially check NFS configs.

-- 
Harvey Stein
Bloomberg LP
hjstein at bloomberg.com