memory leak
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Brian LaMere blamere at diversa.comTue Dec 17 18:43:18 PST 2002
- Previous message: low-latency high-bandwidth OS bypass user-level messagingforcommodity(linux) clusters with commodity NICs(<$200),HELP!(GAMMA/EMP/M-VIA/etc.)
- Next message: memory leak
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Having not gotten very far, I thought I'd ask you all for some advice... I have a memory leak ... somewhere. It *appears* to be an nfs file caching issue. The nodes pick up 750Mb files from an NFS server, and crunch on them. They're only doing this to one at a time. After about a week, they've used up all the memory on the systems. I say all.../almost/ all. Never all. Never is the swap used, and of the 2Gb ram on each node, there's always 16-40Mb free minimum. Problem is that the nodes lose the ability to cache those 750Mb files, and have to then start going out and grabbing it after each and every run. Since a job takes only a couple seconds, having to grab it each time is terribly inefficient. The master (which has 3.25Gb ram) exhibits the same behavior - for whatever reason, almost all of its memory is used up too. Nothing short of (blech) rebooting the systems will clear out the memory. Then, after about a week, they're all full again. The particular file they're caching right now has been the same since Sep24, and absolutely nothing non-data related has changed on the cluster (ie, OS files, modules, settings, whatnot) since the beginning of September. There have been some external changes, but... This problem has only been occurring for about a month. Top doesn't report anything using a lot of memory (a few 0.2's, a few 0.1's, then 0.0's percentage-wise for master), and when sorted by memory usage nothing over 1% is listed. "free" doesn't even show an exorbitant amount being used for cacheing, I'm just led to believe its that due to the fact that the memory never gets 100% used (I can increase the load, even when there is only 16M free of memory, and swap won't be touched). Instead of going through and telling everyone all the things I've tried for the last month (would take a long while), I'd rather just see what sort of suggestions people might have. Where does one find a memory hole? Malicious code is, I suppose, theoretically possible. Unfortunately (damn it) I don't have tripwire up, but I'm not terribly sure that would matter. It just doesn't feel like that's the right direction. Suggestions? Thoughts? Advice? I'm open for anything. Rebooting the cluster once a week hurts...have to though, cause it takes down the nfs server otherwise (what with constant requests for 750Mb files). I can't think of anywhere else to look other than where I have already. I've stared at the proc tree so long I'm goin crazy. basic info: Nodes are dual 1ghz p3's, with 2Gb ram and a 18gb local disk. Master is dual 1ghz p3 with 3.25Gb ram and mirrored 36 Gb disk. Running Scyld 28cz4rc3 Running nfs version 3 /proc/sys/vm/bdflush (not changed, at default): 40 0 0 0 500 3000 60 0 0 (hopeful) pre-emptive thanks, Brian LaMere Diversa Corp
- Previous message: low-latency high-bandwidth OS bypass user-level messagingforcommodity(linux) clusters with commodity NICs(<$200),HELP!(GAMMA/EMP/M-VIA/etc.)
- Next message: memory leak
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
