bpsh and memory leak - wien

Donald Becker becker at scyld.com
Wed Oct 2 13:10:35 PDT 2002


On Wed, 2 Oct 2002, Florent Calvayrac wrote:

> Donald Becker wrote:
> > 
> > On Tue, 1 Oct 2002, Florent Calvayrac wrote:
> > 
> > > We try to use WIEN97 on our Scyld beowulf cluster, and
> > > the following happens : the program lapw1 (more or less
> > > pure fortran 77), run interactively on the front node,
> > > happily grows to, say, 30Mo and then runs until completion.
> > > When run with bpsh on a remote node, the available memory
> > > just shrinks down until the system swaps to stall.
...
> > What is using the memory?
> 
> I do not know !

You can run 'bpsh 0 ps aux' to see the memory usage.

> On the front node, we type 
>   /home/wien/lapw1 lapw1.def 
> hit enter and it just runs fine
>   bpsh 0 /home/wien/lapw1 lapw1.def
> and a "bpsh 0 free -t" shows that available memory runs down
> to 0, then the red light of the hard disk starts, 

OK, my guess is still that some output is consuming memory on the
RAMdisk root.

You can see the ramdisk usage with
   bpsh 0 df
or with the beostatus tool.

> (and "bpsh 0 cat /proc/meminfo" confirms that the system starts
> to swap). I do not understand a single thing on the output of slabinfo.

The slabinfo output requires expert interpretation, and even then the
expert must already have narrowed down the likely problems.

> front and remote nodes). The process takes about 2 minutes
> however to transform itself from "init" to the actual "lapw1" we run,
> and sometimes fails with a "BProc move failed", maybe because
> NFS is hit hard in the process.

Two minutes?  That's an exceptionally long time.
Is this a Fortran program with a large pre-defined common area?
  (We had to make a change in our system to avoid transferring
  monsterously large zeroed-out Fortran common areas -- that may be the
  problem here.)

> By the way, "top" does not run properly, indicating a size of "0"
> for remote processes when ran on the front node,

Yes, that's right.  The current version of 'top' is reporting the memory
used on the front end.  An older version of the system accurately
reported the memory usage on the compute, but the impossible numbers
confused programs.

> and is not willing to run on remote nodes 
> ("bpsh 0 top" fails because of an unknown TERM error - I
> guess termcap is not installed properly on the 
> remote nores, I even tried with a TERM=vt100) 

There is not a controlling 'tty'.

-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993




More information about the Beowulf mailing list