[Beowulf] Queue Systems

Reuti reuti at staff.uni-marburg.de
Thu Sep 6 14:16:47 PDT 2007


Am 06.09.2007 um 18:27 schrieb Chris Dagdigian:

>
> { Declaration of bias; I run the http://gridengine.info site in my  
> spare time ... }
>
> I'm quite familiar with both LSF and SGE, using both products in my  
> professional work and helping clients with queue system selection,  
> deployment, application integration and training.  I'm less  
> familiar with PBS/Torque/etc. having only run those in small  
> virtualized lab environments. At the time when I was looking at  
> open source solutions, none of the PBS variants supported array  
> jobs so I went with SGE and never looked back.

Another thing is Tight Integration of parallel runs, which is  
available in PBS/Torque for LAM-MPI and OpenMPI, but not for HP-MPI,  
Linda or PVM. You can use it with these queuing systems of course,  
but the slave processes are not controlled by them, nor will you get  
a correct accounting. SGE offers an rsh replacement called qrsh which  
will support these parallel environments.

-- Reuti

> The current state of the art is quite good. For 90% of use cases  
> and end-user requirements you really can't go wrong with any of the  
> available products.
>
> Everything out there (open source or commercial) is capable of  
> doing the standard sort of "policy based resource management on  
> distributed systems" that we all care about.
>
> So with all products capable of doing just about everything you  
> would need, making an actual product selection comes down to areas  
> other than the functionality of the queueing core.
>
> Things like:
>
> - Administrative burden (if keeping PBS from falling over requires  
> a full time employee; the cost of LSF looks far more attractive for  
> instance ...)
> - Cost
> - Quality of support
> - Quality of technical documentation
> - Quality of training / professional services
> - Layered products that enhance base functionality
>
> Platform LSF is the gold standard. Low administrative burden, great  
> documentation/support and resiliency features that competitors  
> still have a tough time matching and all wrapped up with additional  
> (at extra cost of course) layered products that nobody else can  
> really touch. The downside? Cost of course. In particular the  
> current Linux pricing model punishes you for putting more than 4GB  
> of RAM into a compute node or using a non X86/X86_64 architecture  
> -- in both cases you'll get bounced out of the "cheap" license  
> category and into a far more expensive one where the cost of the  
> software license is in the same ballpark as the cost of the server  
> hardware.
>
> Platform will happily sell you additional layered products that can  
> do things like:
>
> - Tight integration with FlexLM license servers; more powerful than  
> the standard load sensor (SGE) and elim (LSF) methods that people  
> do "for free"
> - Seriously hardcore reporting and analytic tools suitable for the  
> largest enterprises
> - Tight integration with parallel environments and high speed  
> interconnects (plus support for these environments which is non- 
> trivial)
> - SLA-aware scheduling
> - Multi-cluster aware scheduling
> - etc. etc.
>
> The base version of LSF also ships with a basic reporting module  
> and a tomcat-driven web interface that is suitable for users  
> (submit and monitor jobs) as well as admins (manage queues and  
> hosts). SGE in particular does not really have anything like this  
> except for ARCo on the reporting side and ARCo is no match for even  
> the "free" reporting module you get with LSF 7.x
>
> That said though, it's been my experience that a vast majority of  
> the "market" does not need and will not likely ever need some of  
> the advanced/enterprise level add-ons that integrate so cleanly  
> with the base Platform LSF products.
>
> So this drops me back down into my original argument that just  
> about any of the available products will perform well at doing what  
> you need. The key advice I have is to understand that everyone is  
> pretty good at the basic functions so you'll have to make your  
> selection decision based on some of the other criteria I tried to  
> list above.
>
>
> My general rule of thumb for new projects is to start with the  
> assumption that I'll be using Grid Engine. Then, after more formal  
> understanding of the work-flows and customer requirements are  
> achieved it may become clear that Platform LSF is a better choice.
>
> For all of 2007 I'd probably take a guess at saying that I've  
> worked on 20+ Grid Engine systems and deployed LSF just once for a  
> large enterprise customer.
>
>
> My $.02 of course!
>
> Regards,
> Chris (posting from my non-corporate address)
>
>
>
>
>
>
>
>
> On Sep 6, 2007, at 5:30 AM, andrew holway wrote:
>
>> Hi,
>>
>> We are trying to work out the differences between these queue  
>> systems.
>>
>> Can anyone shed any light? Pros and Cons...
>>
>> SGE, Torque (with Maui), PBSPro and LSF
>>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list