[Beowulf] Queue Systems

Chris Dagdigian dag at sonsorol.org
Thu Sep 6 09:27:38 PDT 2007

{ Declaration of bias; I run the http://gridengine.info site in my  
spare time ... }

I'm quite familiar with both LSF and SGE, using both products in my  
professional work and helping clients with queue system selection,  
deployment, application integration and training.  I'm less familiar  
with PBS/Torque/etc. having only run those in small virtualized lab  
environments. At the time when I was looking at open source  
solutions, none of the PBS variants supported array jobs so I went  
with SGE and never looked back.

The current state of the art is quite good. For 90% of use cases and  
end-user requirements you really can't go wrong with any of the  
available products.

Everything out there (open source or commercial) is capable of doing  
the standard sort of "policy based resource management on distributed  
systems" that we all care about.

So with all products capable of doing just about everything you would  
need, making an actual product selection comes down to areas other  
than the functionality of the queueing core.

Things like:

- Administrative burden (if keeping PBS from falling over requires a  
full time employee; the cost of LSF looks far more attractive for  
instance ...)
- Cost
- Quality of support
- Quality of technical documentation
- Quality of training / professional services
- Layered products that enhance base functionality

Platform LSF is the gold standard. Low administrative burden, great  
documentation/support and resiliency features that competitors still  
have a tough time matching and all wrapped up with additional (at  
extra cost of course) layered products that nobody else can really  
touch. The downside? Cost of course. In particular the current Linux  
pricing model punishes you for putting more than 4GB of RAM into a  
compute node or using a non X86/X86_64 architecture -- in both cases  
you'll get bounced out of the "cheap" license category and into a far  
more expensive one where the cost of the software license is in the  
same ballpark as the cost of the server hardware.

Platform will happily sell you additional layered products that can  
do things like:

- Tight integration with FlexLM license servers; more powerful than  
the standard load sensor (SGE) and elim (LSF) methods that people do  
"for free"
- Seriously hardcore reporting and analytic tools suitable for the  
largest enterprises
- Tight integration with parallel environments and high speed  
interconnects (plus support for these environments which is non-trivial)
- SLA-aware scheduling
- Multi-cluster aware scheduling
- etc. etc.

The base version of LSF also ships with a basic reporting module and  
a tomcat-driven web interface that is suitable for users (submit and  
monitor jobs) as well as admins (manage queues and hosts). SGE in  
particular does not really have anything like this except for ARCo on  
the reporting side and ARCo is no match for even the "free" reporting  
module you get with LSF 7.x

That said though, it's been my experience that a vast majority of the  
"market" does not need and will not likely ever need some of the  
advanced/enterprise level add-ons that integrate so cleanly with the  
base Platform LSF products.

So this drops me back down into my original argument that just about  
any of the available products will perform well at doing what you  
need. The key advice I have is to understand that everyone is pretty  
good at the basic functions so you'll have to make your selection  
decision based on some of the other criteria I tried to list above.

My general rule of thumb for new projects is to start with the  
assumption that I'll be using Grid Engine. Then, after more formal  
understanding of the work-flows and customer requirements are  
achieved it may become clear that Platform LSF is a better choice.

For all of 2007 I'd probably take a guess at saying that I've worked  
on 20+ Grid Engine systems and deployed LSF just once for a large  
enterprise customer.

My $.02 of course!

Chris (posting from my non-corporate address)

On Sep 6, 2007, at 5:30 AM, andrew holway wrote:

> Hi,
> We are trying to work out the differences between these queue systems.
> Can anyone shed any light? Pros and Cons...
> SGE, Torque (with Maui), PBSPro and LSF

More information about the Beowulf mailing list