bpsh and memory leak - wien
Florent Calvayrac
Florent.Calvayrac at univ-lemans.fr
Wed Oct 2 08:24:07 PDT 2002
Donald Becker wrote:
>
> On Tue, 1 Oct 2002, Florent Calvayrac wrote:
>
> > We try to use WIEN97 on our Scyld beowulf cluster, and
> > the following happens : the program lapw1 (more or less
> > pure fortran 77), run interactively on the front node,
> > happily grows to, say, 30Mo and then runs until completion.
> > When run with bpsh on a remote node, the available memory
> > just shrinks down until the system swaps to stall.
>
> You can use 'top' or 'ps' on the master to monitor memory usage of the
> process
thanks a lot 8-) !
> What is using the memory?
>
I do not know !
To summarize :
on the front node, we type
/home/wien/lapw1 lapw1.def
hit enter and it just runs fine
bpsh 0 /home/wien/lapw1 lapw1.def
and a "bpsh 0 free -t" shows that available memory runs down
to 0, then the red light of the hard disk starts,
(and "bpsh 0 cat /proc/meminfo" confirms that the system starts
to swap). I do not understand a single thing on the output of slabinfo.
"ps -efl" gives the same result on the front node
than on the remote nodes (with "bpsh 0 ps -efl" (in our case, "186888", amounting
to something like 30 Mbytes, knowing that we have 512 MBytes on the
front and remote nodes). The process takes about 2 minutes
however to transform itself from "init" to the actual "lapw1" we run,
and sometimes fails with a "BProc move failed", maybe because
NFS is hit hard in the process.
We gave a look to the open files with "lsof", and they are the same
on the front node and the remote nodes. There does not seem to
be any error files opened on the remote node ; however I wrote
a small C program to fill the ramdisk...and when the ramdisk was full,
nothing peculiar happened, the node did not start to swap !
By the way, "top" does not run properly, indicating a size of "0"
for remote processes when ran on the front node, and is not willing to run on remote nodes
("bpsh 0 top" fails because of an unknown TERM error - I
guess termcap is not installed properly on the
remote nores, I even tried with a TERM=vt100)
Thanks for the help anyway
If anyone has any ideas...
--
Florent Calvayrac |
Laboratoire de Physique de l'Etat Condense |
UMR-CNRS 6087 | http://www.univ-lemans.fr/~fcalvay
Universite du Maine-Faculte des Sciences |
72085 Le Mans Cedex 9
More information about the Beowulf
mailing list