Linux memory leak?

Fri Mar 1 12:21:22 PST 2002

Mark Hahn wrote:
> 
> > normal.  Still, if 'free' cannot be always trusted, then system
> > management decisions based on free memory can be flaky.
> 
> free=wasted, to Linux.  if you're worried about MM sanity,
> you should look at swap *ins*...

I'm primarily interested in RAM usage minus buffers minus cache, used by
a batch scheduler to avoid paging.  The 'free' problem can happen on our
384MB, 512Mb, 1GB and 2GB machines, but it is similar to the Red Hat bug
report http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=59002 which
talks about 2GB+ systems.  Gerry Morong describes the problem this bug
causes for the LSF batch scheduler: 

> I am experiencing a similar problem on Red Hat 7.2 with the 2.4.9-* kernels.  
> If I run jobs past core memory into swap, significant memory and swap are still 
> allocated when the jobs finish.  Have tested many configuration (most 2 
> processor) 1GB, 2GB, 3GB, 4GB RAM.  All have the same problem.  Example:  2P, 
> 4GB RAM, 8GB swap.  Run 2 jobs both asking for 2.5GB ( total 5GB ).  Memory and 
> swap both push 4GB each.  When the jobs finish, both memory and swap are still 
> holding 2.5GB of space each.  Eventually our compute farm managed by LSF will 
> not allocate jobs to each machine because free memory is almost non-existent.

The key point above is the comment about LSF: bad system data leads to
bad scheduler behavior.

BTW, the "malloc() all memory then quit" procedure does not always fix
the numbers reported by 'free'.  On our largest node (2GB RAM plus 4GB
swap) running the stock (Red Hat) kernel 2.4.9-21, the maximum space
malloc() can get is 1919 MB, not even close to the 3 GB process address
space limit even though 'ulimit' is unlimited.  After this 1919 MB is
reached, my test program quits (thereby releasing memory), but 'free'
numbers remain unreasonable.

Finally, our machines do not enter this "missing memory" state at
random.  It seems that some users' MPI-based parallel code(s) can force
the machine into that state, while other codes run fine.  This suggests
that Linux kernel 2.4.9 allows a mere application to royally mess up its
'free' numbers.  BTW, Red Hat just tweaked their stable kernels to
2.4.9-31, but not yet 2.4.17.

Sincerely,
Josip

P.S.  Here is a simple program to figure out how many MB can malloc()
grant.  BTW, malloc() error detection (via errno!=0 or via
malloc()==NULL or via environment variable MALLOC_CHECK_=1) in Linux is
not very reliable (the program often gets terminated before printing the
final result).  This is why it helps to print the number of MB allocated
after each successfull malloc().

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>

extern int errno;

#define MAXMB (4<<10)
#define MB (1<<20)
#define PG (4<<10)

int main(argc,argv)
int argc;
char *argv[];
{
        char *m;
        int i,j;

        printf("PG = %d\n",PG);
        printf("MB = %d\n",MB);
        printf("MAXMB = %d\n\n",MAXMB);
        sleep(3);
        for(i=0;i<MAXMB;i++) {
                m=malloc(MB);
                printf("%d MB ...",i+1);
                if(errno || (m==NULL)) break;
                for(j=0;j<MB;j+=PG) m[j]='A';
                printf(" OK\n");
        }
        printf("\n\nAllocated %d MB\n",i);
        exit(0);
}

-- 
Dr. Josip Loncaric, Research Fellow               mailto:josip at icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134