[Beowulf] Please help to setup Beowulf
Reuti
reuti at staff.uni-marburg.de
Tue Feb 17 12:47:53 PST 2009
Hi, to throw in one more:
Am 17.02.2009 um 20:51 schrieb Chris Dagdigian:
> On Feb 17, 2009, at 2:29 PM, Michael Will wrote:
>
>> What features differentiate SGE in support of life science workflow
>> from LSF/PBS/Torque/Condor?
is anyone using IBM's LoadLeveler for Linux and why? I saw that it's
available already some time ago, but I got the impression, that it's
mostly addressing sites which have already LoadLeveler on AIX, and
don't want to introduce a second queuingsystem in their infrastructure.
> They all have their pros and cons, heck I'm still an LSF zealot
> when cost is not an issue as Platform has the best APIs,
> documentation and layered products for the industry types who need
> to stand these things up in full production mode within enterprise
> organizations that may have varying levels of Linux/HPC/MPI
> experience.
>
> The short list of why Grid Engine became popular in the life sciences:
>
> LSF: great product but commercial-only and a pricing model that can
> get out of hand (I remember when having more than 4GB RAM in a
> Linux 1U pushed me into an obscene license tier ...).
>
> Condor: Did not have the fine grained policy and resource
> allocation tools that make life easier when you need to have a
> shared cluster resource supporting multiple competing users,
> groups, projects and workflows. The policy tools for LSF/SGE/PBS
> were more capable. When I saw condor out in the field seemed to be
> mostly used only in academic sites and in situations where cycles
> from PC systems were being aggregated across LAN, metro and wan-
> scale distances. Bio problems tend to be more I/O or memory bound
> rather than CPU bound so most bio clusters tend to be closely
> situated racks of gear.
>
> PBS/TORQUE: I'll ignore the FUD from back in the day when people
> were claiming that PBS lost jobs and data at high scale and
> concentrate on just one key differentiator. At the time when life
> science was transitioning from big SGI Altix and Tru64 Alphaservers
> machines to commodity compute farms, PBS did not support the
> concept of array jobs. If there was one overwhelming cluster
> resource management feature essential for bio work
> it would be array tasks. This is because we tend to have a very
> high concentration of batch/serial workflows that involve running
> an application many many times in a row with varying input files
> and parameter options. The cliche example in bioinformatics is
> needing to run half a million blast searches. Without array task
> scheduling this would require 500,000 individual job submissions.
> The fact that I never met a serious PBS shop that had not made
> local custom changes to the source code also soured me on deploying
> it when I was putting such things into conservative IT shops who
> were still new and fearful of Linux.
One thing more: AFAIK Torque has no scheduler built-in besides the
FIFO one. You will need MAUI (free) or MOAB (commercial) to get a
scheduler, with the side effect to have to use "qstat" (for Torque)
and "showq" (for MAUI) to investigate the status of the jobs.
-- Reuti
> We also don't make heavy use of the globus style WAN-scale capital
> "G" grid computing as much of our workflows and pipelines are
> actually performance bound by the speed of storage rather than CPU
> or memory issues. It was always easier, cheaper and more secure to
> colocate dedicated CPU resources local to fast storage rather than
> distribute things out as far as possible.
>
> The big news in Bio-IT these days is actually the terabyte scale
> wet lab instruments such as confocal microscopes and next-gen DNA
> sequencing systems that can produce 1-3TB of raw data per
> experiment. Some of these lab instruments ship with software
> pipelines developed to run under grid engine. A popular example is
> the Solexa/Illumina Genome Analyzer which alone has driven SGE
> uptake in our field. A notable exception is the SOLiD system which
> (I think) ships with a Windows front end that hides a back end
> ROCKS cluster running either PBS or torque under the hood.
>
>
> And from Mark:
>
>> how about providing some useful content - for instance, what is it
>> that you think is especially valuable about sge?
>
> Hopefully I've done some of that with this message. It basically
> boils down to the fact that at the time our field started using
> compute farms in a serious manner, SGE offered the best overall
> combination of features, price and fine grained resource allocation
> & policy control. I think what made us a bit different from some
> other use cases is our heavy use of serial/batch workflows combined
> with our tendency to require that our HPC infrastructures support
> multiple (and potentially competing) workflows and pipelines which
> made the policy/allocation features a key selection criteria. We
> also do little if any true WAN-scale "grid" computing due to
> workflows that tend to be more storage/IO bound than anything
> else. For people starting fresh with a cluster scheduling layer
> who did not have an investment in time, expertise and/or software
> licensing costs, Grid Engine turned out to be a popular choice.
> With that popularity came a good set of people in the community who
> can now support and configure these systems (as well as evangelize
> them) so the cycle is fairly self perpetuating.
>
>
> General life science cluster cheat sheet:
>
> - Workloads tend to be far more serial/batch in nature than true
> parallel
> - Policy and resource allocation features are very important to
> people deploying these systems
> - Storage speed is often more important than network speed or
> latency in many cases
> - Fast interconnects are often used for cluster/distributed
> filesystems rather than application message passing
> - Our MPI codes are often quite horrific from an efficiency/tuning
> standpoint - gigE works just as well as Myrinet or IB
> - Exceptions to the MPI rule: computational chemistry, modeling and
> structure prediction (those fields have well written commercial MPI
> codes in use)
> - Huge resistance to improved algorithms as scientists want to use
> *exactly* the same code that was used to publish the journal paper
>
>
> -Chris
>
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list