BLAST or wu-blast for beowulf?

Thu Apr 19 17:44:46 PDT 2001

Hi folks

Sorry that I haven't found time to answer this before, been very busy
setting up our new company here in Toronto.

The short answer is that so far there are only commercial
implementations available (www.computefarm.com or www.sgi.com), or run
BLAST with PBS and script it yourself, or set up the www-based cgi's for
BLAST and run them behind a load balancer.

The long "archive-quality" answer follows...

BLAST or WU-BLAST are bioinformatics applications that compare protein
or DNA sequences to databases with DNA or proteins to find
similarities.  The original programs are highly optimized for
multiprocessor machines like Sun and SGI boxes upon which they were
originally developed.

The BLAST executables (original, non clustered versions) are at
ftp://ncbi.nlm.nih.gov/blast 
and WU-BLAST is at
http://blast.wustl.edu/

When we refer to BLAST jobs, we call them a "query" which is one
sequence being compared to one database.

There are several issues about running BLAST on a cluster, and different
implementation objectives - The answer is it depends on what you want
clustered BLAST to do!  These vary quite a bit, and require different
implementations.

Here's some examples of what your objectives might be:
1) I want to run a lot of BLAST queries in batches.
2) I want more speed on a single BLAST query.
3) I have a BIG DNA database to search through.
4) I want to set up a web-interface BLAST service on a cluster for
users.

In all cases, the implementation also needs scripts to do the daily
updating of databases stored on the local node hard disks.  Figure on
doing some work here, PERL helps.

I address these situations:

1)  I want to run a lot of BLAST queries.

Then you want a compute farm approach.  Many people use load sharing
software like LSF or PBS to execute BLAST on compute farms. You will
also need to make scripts to ftp download and update the databases on
all the nodes as a regular process or a cron job.   

2)  I want more speed on a single BLAST query.  

BLAST becomes I/O bound very quickly on an SMP machine, and doesn't
really scale that well on a cluster for a single query.  It is already
multithreaded. Amdahl's law gets you very quickly in BLAST if you try
interprocess communication as a model for speeding it up, so forget it.  
So if you want speed, add memory, faster CPUs or more of them, or chunk
the database into pieces (see 3 below).  I suggest to run multithreaded
BLAST on dual CPU nodes with sufficient disk and memory to store the
databases.  Remember to use the processor number argument too, it needs
to be told how many CPUs to run on.

3) I have a BIG DNA database to search through and must partition it.

People who use BLAST on protein databases have smaller memory
requirements than those using BLAST on DNA databases.  The DNA databases
are much larger, and in commercial compaines can be up to several 10's
of Gigs. Companies often set up SMP machines with lots of RAM as BLAST
servers, and they are typically not Linux boxes.

Databases that don't fit in memory often cause the computers to thrash,
esp. if you have multiprocessor machines running.  e.g a dual cpu node
with 128Gb RAM with two processes running will thrash horribly on a
large DNA database as each thread competes to load the database chunk it
is working on into the same block of memory.  

BLAST uses memory-mapped I/O, so that multiple instances can use the
same data in memory, and it works best when the whole database fits in
memory and multiple processes can have at it.  

Blackstone computing (www.computefarm.com) makes a clustered commercial
version of BLAST that operates, apparently using a redeployment of
memory-mapped I/O.  It seems to broadcast to the cluster that it is
looking for a file when looking for a piece of a database, and it grabs
any copy of that file already in memory BLAST databases from another
node through a socket.  So it does a memory-memory transfer rather than
a disk-memory transfer.  I have not tried this implmentation.  It also
may require some heavy scripting to break up the BLAST databases to
match your cluster node size and memory.  Figure doing this on a daily
update cycle.

Anyhow, the memory-mapped I/O trick is an interesting one that could be
implemented at the LINUX kernel level somewhere, I think as a general
purpose cluster utility.

4) I want to set up a web-interface BLAST server.

This is a common desire, but is not really a cluser issue.  A good
single CPU machine can do this nicely for a few casual users, again put
enough RAM in it.
Look here for precompiled executables for the CGI versions of BLAST.
ftp://ncbi.nlm.nih.gov/blast/server/
They have executable cgi's for Linux, Tru64, SGI and Solaris.
If you set these up with a load-balancer on several nodes, you may have
what you are looking for for more users.

Christopher Hogue, Ph.D.
CIO MDS Proteomics 
http://www.mdsproteomics.com

On leave from the Samuel Lunenfeld Research Inst. Mt. Sinai, Toronto.
http://bioinfo.mshri.on.ca

gregory j pryzby wrote:
> 
> I am looking for infromation (w/o much success) to see if there is a
> version of BLAST that will run on a beowulf cluster.
> 
> --
> greg pryzby                      greg at pryzby dot org
> ach tee tee pee colon slash slash pryzby dot org slash
> fingerprint: 8A1A DB90 869F 5DD1 D6E9 EEB6 C156 6B04 849F A86F
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf