Lost cycles due to PBS (was Re: Uptime data/studies/anecdotes)
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Chris Black cblack at EraGen.comTue Apr 2 12:09:34 PST 2002
- Previous message: Uptime data/studies/anecdotes ... ?
- Next message: Lost cycles due to PBS (was Re: Uptime data/studies/anecdotes)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Tue, Apr 02, 2002 at 12:46:07PM -0600, Roger L. Smith wrote: > On Tue, 2 Apr 2002, Richard Walsh wrote: [stuff deleted] > PBS is our leading cause of cycle loss. We now run a cron job on the > headnode that checks every 15 minutes to see if the PBS daemons have died, > and if so, it automatically restarts them. About 75% of the time that I > have a node fail to accept jobs, it is because its pbs_mom has died, not > because there is anything wrong with the node. > We used to have the same problem with PBS, especially when many jobs were in the queue. At that point sometimes the pbs master died as well. Since we've switched to SGE/GridEngine/CODINE I've been MUCH happier. Plus there are lots of nifty things you can do with the expandibility of writing your own load monitors via shell scripts and such. The whole point of this post is: GNQS < PBS < Sun Gridengine :) Chris (who tried two other batch schedulers until settling on SGE) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 232 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20020402/1433e290/attachment.bin
- Previous message: Uptime data/studies/anecdotes ... ?
- Next message: Lost cycles due to PBS (was Re: Uptime data/studies/anecdotes)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
