[Beowulf] scheduler recommendations for a HPC cluster

Rahul Nabar rpnabar at gmail.com
Tue Oct 6 12:22:14 PDT 2009


Any strong / weak recommendations for / against schedulers? For a long
time we have worked happily with a Torque + Maui system. It isn't
perfect but works (and is free!). But rarely does a chance present
itself to go for something "newer and better" on a in-production
system since people hate changes and outages. This time as we shop for
a new cluster it presents me the opportunity to change if something
better exists.

Any comments? What are other users using out there?  Any horror
stories? Or any super good finds?

I shy against LSF etc since those cost a lot of money.  Especially as
they, and similar systems are mostly licensed per server per year so
the costs do add up. I have been a user on  a LSF systems for a long
time and I think it is an awesome scheduler but have never been at the
admin end of LSF.

One thing that the Torque+Maui option is not the best is that it is
not monolithic. Oftentimes it is hard to know which component to blame
for a problem or more relevant which config file to use to fix a
problem. Torque or Maui. On the other hand , can't get rid of Maui
since Fairshare policies etc. are important to us and those seem to be
in the Maui domain. (all our jobs are MPI jobs in case that is
relevant. We haven't been doing checkpointing yet)

Of course, there is MOAB these days, but I am not sure if that is
worth the money since I have not used it.

I appreciate any comments or words of wisdom you guys might have!

-- 
Rahul



More information about the Beowulf mailing list