blamere at diversa.com
Wed Dec 18 07:18:21 PST 2002
Yes, I am aware.
It was not until a month ago that performance started becoming an issue,
however. And it was not until yesterday that the cluster almost crippled
the NFS server.
The file in particular they were hitting when this occurred has been the
same since Sep24.
I am fully aware that the memory is still available. The problem is that
the buffers are not - and as such, it grabs the file *each and every time*,
as I said. If I reboot them, they do not grab the file each and every time.
I would love it if the buffers would get released, but they're not. I
thought I said this before, however? Jobs get completed about 2 a minute
when the memory is listed as "full." They get completed about 200 a minute
when the memory isn't. I'm really not sure how to paint a clearer picture
than that. What /should/ occur in theory is not, in fact, occurring. A
node can sit there unused for as much as 24 hours, and still exhibit the
When memory is not listed as full, the nodes slam the NFS server for a
moment, just long enough to grab whatever flatfile database the current tool
is running against. Then there is almost no network traffic at all for
hours, esp to the NFS server. This behavior changes after about a week.
This behavior never changed before.
I'm repeating myself simply because I obviously wasn't clear before. Yes, I
know buffers are held until something else is needed to be buffered, based
on a retention policy. The only insult to my intelligence is in my lack of
clarity in the description of what is going on.
From: John Hearns [mailto:John.Hearns at cern.ch]
Sent: Wednesday, December 18, 2002 2:45 AM
To: Brian LaMere
Cc: beowulf at beowulf.org
Subject: Re: memory leak
Brian, please forgive me if I am insulting your intelligence.
But are you sure that you are not just noticing the disk
buffering behaviour of Linux?
The Linux kernel will use up spare memory as disk buffers -
leading an (apparently) lack of free memory.
This is not really the case - as the memory will be released
again when needed.
(Ahem. Was caught out by this too the first time I saw it...)
Anyway, if this isn't the problem, maybe you could send
us some of the stats from your system?
Maybe use nfsstat?
More information about the Beowulf