Parallel BLAST

Sun Apr 14 19:32:20 PDT 2002

> Why is it that BLAST is not available for MPI/PVM?  I would think
> clusters would be the prefect host for such an application.
> Is it there is no need because BLAST is already so fast and
> no one wants to break the database out onto node-resident disks?
> Or is it that BLAST is kept running on single processor or shared memory 
> machines BLAST so that the DB is always in memory ready to roll without
> loading and doing the same for a cluster is not worth it
> because the same trick is difficult to do on a node given the current
> way clusters are built?  I assume the same is true for FASTA?

I suspect that BLAST is not available for MPI/PVM because (1) it is
too fast, and (2) there is not much demand for it.  

95% of the time, BLAST is almost an in-memory grep (the other 5% of
the time it is working on the things it is looking for).  Sequence
comparison is embarrassingly parallel, and very easily threaded.
Distributing the sequence databases and collecting results has more
overhead (there probably aren't many distributed grep programs
either).  FASTA is 5 - 10X slower than BLAST, and Smith-Waterman is
another 5-20X slower than FASTA.  Here, the communications overhead is
low, and distributed systems work OK for FASTA, and great for
Smith-Waterman (where the overhead fraction is very small).

Of course, it is a lot easier to compile a threaded program, and just
run it, than it is to install and configure the MPI or PVM environment
and the programs to run in it.  Bioinformatics software is often run
by computer savvy biologists, not high-performance computing folks,
and not having to install and configure PVM/MPI is a big advantage.
The NCBI probably does not make a PVM/MPI parallel BLAST because there
is very little demand for it, and it does not meet their computational
needs.

Bill Pearson