memory leak
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduWed Dec 18 09:39:44 PST 2002
- Previous message: memory leak
- Next message: memory leak
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, 18 Dec 2002, Brian LaMere wrote: > When memory is not listed as full, the nodes slam the NFS server for a > moment, just long enough to grab whatever flatfile database the current tool > is running against. Then there is almost no network traffic at all for > hours, esp to the NFS server. This behavior changes after about a week. > This behavior never changed before. > > I'm repeating myself simply because I obviously wasn't clear before. Yes, I > know buffers are held until something else is needed to be buffered, based > on a retention policy. The only insult to my intelligence is in my lack of > clarity in the description of what is going on. Can you rearrange your program(s) to not use NFS (which sucks in oh, so many ways)? E.g. rsync the DB to the nodes? Have you tried tuning the nfsd, e.g. increasing the number of threads? Have you tried tuning the NFS mounts themselves (rsize, wsize)? Have you considered that the problem could be with file locking -- if you have lots of nodes trying to open and read, but especially to write, to the same file, there could be a all sorts of queues and problems being created with file locking (rpc.lockd). Have you tried to resolve this by (perhaps) maintaining several copies of the files in contention and spreading the open/close load around? Have you considered the problems associated with plain old latency -- e.g., suppose that application a on node A opens file X on the server, reads a bunch of stuff from it, and then writes a bit onto the end, and closes it. In the meantime, application b on node B is trying to open it. I >>think<< that NFS is required to flush the modified image through to disk before it can reissue the image to another request (part of its being a "reliable" protocol, so that application b doesn't see the "wrong image" of the file). This can take anything from hundredths of seconds to seconds, depending on file size and server load, so you might not see any problem at all as long as demand is lower than some threshhold and then "suddenly" start seeing it as you start to encounter "collision storms". This used to happen a lot on shared 10 Mbps ethernet, especially thinwire when the lengths were borderline too long and to heavily laden with hosts (so the probability of collisions was relatively high) -- an entire network could be nonlinearly brought to its knees by a single host inching the total network traffic up over a critical level, causing error recoveries and retransmissions to start to pile up with positive feedback (re: "packet storm"). Of course nobody can tell you which of these problems is the critical one in your particular situation, but maybe the list of the above will help you debug it. The key thing to do is to try to learn about the particular subsystem(s) associated with the delays. Sure, maybe it's just "a kernel bug" (and the kernel list may be the right place to seek help:-). OTOH, it could very easily be something that is your "fault" in that you have pushed your network out of the regime where stable operation can ever be realistically expected for your particular task architecture. In that case, you'll both have to debug it yourself (figure out what is failing) and figure out how to re-architect it so that it no longer is a problem. Not easy, actually -- takes a lot of trial and effort and can even end up being something REALLY trivial like a bad network cable or bad switch port so that errors you thought were "broken kernel" or even "broken software" were really "bad hardware" and impossible to EVER fix without replacing the bad hardware. nfsstat, vmstat, cat /proc/stat, plain old stat, netstat, and perhaps tools like wulfstat/xmlsysd (available at www.phy.duke.edu/brahma/xmlsysd.html) are your friends. Try clever experiments. Try to isolate the proximate cause of the problem or the precise conditions where it occurs. HTH, rgb > > > -----Original Message----- > From: John Hearns [mailto:John.Hearns at cern.ch] > Sent: Wednesday, December 18, 2002 2:45 AM > To: Brian LaMere > Cc: beowulf at beowulf.org > Subject: Re: memory leak > > Brian, please forgive me if I am insulting your intelligence. > > But are you sure that you are not just noticing the disk > buffering behaviour of Linux? > The Linux kernel will use up spare memory as disk buffers - > leading an (apparently) lack of free memory. > This is not really the case - as the memory will be released > again when needed. > (Ahem. Was caught out by this too the first time I saw it...) > > > Anyway, if this isn't the problem, maybe you could send > us some of the stats from your system? > Maybe use nfsstat? > > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: memory leak
- Next message: memory leak
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
