Lost cycles due to PBS (was Re: Uptime data/studies/anecdotes)

Thu Apr 11 00:53:52 PDT 2002

--- Chris Black <cblack at eragen.com> wrote:
> On Tue, Apr 02, 2002 at 12:46:07PM -0600, Roger L.
> Smith wrote:
> > On Tue, 2 Apr 2002, Richard Walsh wrote:
> [stuff deleted]
> > PBS is our leading cause of cycle loss.  We now
> run a cron job on the
> > headnode that checks every 15 minutes to see if
> the PBS daemons have died,
> > and if so, it automatically restarts them.  About
> 75% of the time that I
> > have a node fail to accept jobs, it is because its
> pbs_mom has died, not
> > because there is anything wrong with the node.
> > 
> 
> We used to have the same problem with PBS,
> especially when many jobs were 
> in the queue. At that point sometimes the pbs master
> died as well.
> Since we've switched to SGE/GridEngine/CODINE I've
> been MUCH happier.
> Plus there are lots of nifty things you can do with
> the expandibility of 
> writing your own load monitors via shell scripts and
> such.
> The whole point of this post is:
> GNQS < PBS < Sun Gridengine :)
> 
> Chris (who tried two other batch schedulers until
> settling on SGE)
> 

I also have similar experience -- I tried PBS, it is
hard to install, and there are not much scheduling
policies -- but it is hard to config.

Then I read the news about SGE, and since it does not
require root access to install/run, I gave it a try. I
did an experience a few weeks ago -- submitting over
30,000 "sleep jobs" to SGE, and it did not die! If
the master host is down, another machine takes over,
so there is not lost of computing power.

I think SGE 5.3 is better than anything available. I
tried commerical DRM systems, other open source
packages, but so far SGE is by far the best.

BTW, Chris, how many nodes are there in your cluster?

-Ron

P.S. I'm doing a port of SGE to FreeBSD, hope people
find it useful

__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/