[Beowulf] Queue Systems
Chris Dagdigian
dag at sonsorol.org
Thu Sep 6 09:27:38 PDT 2007
{ Declaration of bias; I run the http://gridengine.info site in my
spare time ... }
I'm quite familiar with both LSF and SGE, using both products in my
professional work and helping clients with queue system selection,
deployment, application integration and training. I'm less familiar
with PBS/Torque/etc. having only run those in small virtualized lab
environments. At the time when I was looking at open source
solutions, none of the PBS variants supported array jobs so I went
with SGE and never looked back.
The current state of the art is quite good. For 90% of use cases and
end-user requirements you really can't go wrong with any of the
available products.
Everything out there (open source or commercial) is capable of doing
the standard sort of "policy based resource management on distributed
systems" that we all care about.
So with all products capable of doing just about everything you would
need, making an actual product selection comes down to areas other
than the functionality of the queueing core.
Things like:
- Administrative burden (if keeping PBS from falling over requires a
full time employee; the cost of LSF looks far more attractive for
instance ...)
- Cost
- Quality of support
- Quality of technical documentation
- Quality of training / professional services
- Layered products that enhance base functionality
Platform LSF is the gold standard. Low administrative burden, great
documentation/support and resiliency features that competitors still
have a tough time matching and all wrapped up with additional (at
extra cost of course) layered products that nobody else can really
touch. The downside? Cost of course. In particular the current Linux
pricing model punishes you for putting more than 4GB of RAM into a
compute node or using a non X86/X86_64 architecture -- in both cases
you'll get bounced out of the "cheap" license category and into a far
more expensive one where the cost of the software license is in the
same ballpark as the cost of the server hardware.
Platform will happily sell you additional layered products that can
do things like:
- Tight integration with FlexLM license servers; more powerful than
the standard load sensor (SGE) and elim (LSF) methods that people do
"for free"
- Seriously hardcore reporting and analytic tools suitable for the
largest enterprises
- Tight integration with parallel environments and high speed
interconnects (plus support for these environments which is non-trivial)
- SLA-aware scheduling
- Multi-cluster aware scheduling
- etc. etc.
The base version of LSF also ships with a basic reporting module and
a tomcat-driven web interface that is suitable for users (submit and
monitor jobs) as well as admins (manage queues and hosts). SGE in
particular does not really have anything like this except for ARCo on
the reporting side and ARCo is no match for even the "free" reporting
module you get with LSF 7.x
That said though, it's been my experience that a vast majority of the
"market" does not need and will not likely ever need some of the
advanced/enterprise level add-ons that integrate so cleanly with the
base Platform LSF products.
So this drops me back down into my original argument that just about
any of the available products will perform well at doing what you
need. The key advice I have is to understand that everyone is pretty
good at the basic functions so you'll have to make your selection
decision based on some of the other criteria I tried to list above.
My general rule of thumb for new projects is to start with the
assumption that I'll be using Grid Engine. Then, after more formal
understanding of the work-flows and customer requirements are
achieved it may become clear that Platform LSF is a better choice.
For all of 2007 I'd probably take a guess at saying that I've worked
on 20+ Grid Engine systems and deployed LSF just once for a large
enterprise customer.
My $.02 of course!
Regards,
Chris (posting from my non-corporate address)
On Sep 6, 2007, at 5:30 AM, andrew holway wrote:
> Hi,
>
> We are trying to work out the differences between these queue systems.
>
> Can anyone shed any light? Pros and Cons...
>
> SGE, Torque (with Maui), PBSPro and LSF
>
More information about the Beowulf
mailing list