memory leak
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Brian LaMere blamere at diversa.comWed Dec 18 12:29:59 PST 2002
- Previous message: Broadcom NIC supports jumbo frames?
- Next message: memory leak
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I've been running the same version of Scyld's distribution since September. I know there has to have been a change somewhere, but that's what I'm having such a hard time tracking down. The scripts handling the queuing system are altered on a regular basis, but I can't for the life of me figure out why a perl script would alter file caching - I, like everyone else <grin> am aware that the memory isn't really being used. Its purely a buffer thing - has to be. That, or either malicious code that is confusing the memory manager, or a leak (only reason I think leak is because the gradual growth of it). I have updated absolutely nothing on the cluster since Sept24, nor have I made a single parameter change. Nothing is different other than the files on the nfs server, and possibly some settings on the nfs server itself (though I can't figure out how any server settings would cause deterioration over time, instead of relatively initial issues). If I had changed anything, it would be really easy for me to point at and figure out, and I wouldn't be frustrated with it. I have changelogs that I've reviewed, On the cluster, there have been no updates, no installs, no changes. Whenever I install something, it is by one of two ways. Either I install from source, or I install from rpm. The last rpm I installed, according to the rpm database, was on September 10. The newest makefile in either of the src directories is from September 9. According to "ls -alrtR|grep Nov" and "ls -altrR|grep Dec" (and again for Oct) run from inside the /etc directory, nothing has changed (sans the things that get changed at reboot, like mtab) since Sept 24. It really just doesn't make sense to me that a perl script, or even updating a perl binary on an nfs server (and then using it on the cluster), could be causing this problem. The queue system has run nearly flawlessly for a year and a half with this equiptment. If that's possible though - how would I determine why the simple perl script is doing it? The queue system pulls jobs from an outside mysql server...it was updated on the 6th (after this problem started). I'm just looking for suggestions as to what to do other than reboot the systems once a week. If perl can. in fact, cause these types of problems (how?) then hey...I can work with that. But it truely seems to be a buffer problem. I can throw 500M into ram without it complaining, and without swap ever being hit. It just won't cache these files anymore, until I reboot the systems. Does that make sense? Vmstat output below. Wiglaf is the master, and it was last rebooted on the 6th. Node 40 and 42 were both rebooted yesterday. Node 42 was taken out of the queue, and is just sitting there 100% idle (control-ish box). Node 40 is currently idle, but is part of the queue. There are no jobs (right now) in the queue. Despite the cache already showing high on 40, its performance is still good....for now. In a week, I'll need to reboot it, along with everyone else. If *I* knew the answer, I wouldn't be asking the question :) I apologize if I seem frustrated - I am. [root at wiglaf etc]# vmstat procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 0 0 0 0 544580 127004 2533160 0 0 0 2 10 2 1 1 19 [root at wiglaf etc]# bpsh 42 vmstat procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 0 0 0 0 2044748 884 5180 0 0 0 0 52 2 0 0 100 [root at wiglaf etc]# bpsh 40 vmstat procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 0 0 0 0 19020 884 2024524 0 0 0 0 111 73 29 3 68 [root at wiglaf etc]# -----Original Message----- From: Harvey J. Stein [mailto:HJSTEIN at bloomberg.com] Sent: Wed 12/18/2002 9:13 AM To: Brian LaMere Cc: beowulf at beowulf.org Subject: Re: memory leak "Brian LaMere" <blamere at diversa.com> writes: > It was not until a month ago that performance started becoming an issue, > however. And it was not until yesterday that the cluster almost crippled > the NFS server. Given that this started a month ago, I'm going to ask an obvious question, which presumably you've already checked, but you didn't mention it in your messages. Has any software or hardware on the machines changed in this period? This includes kernels, libs, apps, configs, network cards, etc. on the NFS server, the cluster machines, the routers/hubs, DNS, etc. Has the mix of jobs or the jobs themselves running on the cluster changed? I'd do "find / -mtime -90" & especially check NFS configs. -- Harvey Stein Bloomberg LP hjstein at bloomberg.com
- Previous message: Broadcom NIC supports jumbo frames?
- Next message: memory leak
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
