[Beowulf] Maker2 genomic software license experience?
Joe Landman
landman at scalableinformatics.com
Fri Nov 9 04:59:44 PST 2012
On 11/09/2012 06:20 AM, Igor Kozin wrote:
> You nailed it! And not just the code, new codes appear all the time.
> bowtie, bwa, soap2, soap3, bowtie2, snap ..
This was one of the major issues we found when we were pitching
accelerators to VCs in the early 2000's. There were, at the time,
something on the order of 200 phylogenetic codes. Many alignment codes.
Many of code type X.
Seemed that every lab had/used something different, so an accelerator
had to be able to work on a generic set of problems, without complex
rewriting of code.
Moreover, and this might be my biases showing, the code quality wasn't
... erm ... high. Bad design patterns were in use, if any. We ran into
one proteomic code whose authors/users claimed they had a great
threading model, only to look in abject horror at object factories deep
in nested loops.
It was rare, with pretty much the exception of the HMMer code by Sean
Eddy, that people were focused on performance with good
coding/design/implementation. We did some pretty simple recoding of
elements of this, and part of our larger group did some MPI and GPU
work. Many of the concepts were folded into HMMer 3 (current version,
should be the one people use). But this was done as he needed the tool
to be faster, and he paid attention to the issues associated with this.
This had not been true of (most of) the rest the last time I looked.
This is important as one of the things that would tremendously benefit
this community are libraries of routines that can be reviewed,
collected, and used similar to BLAS. Then rather than writing your own
implementation of a particular algorithm, leverage the basic working
plumbing of these libraries to build your code (similar to LAPACK, etc.)
Unfortunately, I see a rather significant "not invented here" viewpoint
in some groups, and its not lacking here. Which means, in many cases,
there are huge efforts expended to re-invent core algorithms. Less
effort to build tools atop another set of tools.
As many have noted, the parallelism is expressed through a scheduler. I
remember calling this style of computing: high throughput. That is,
more parallelism by wider distribution of computation, not maximization
of single run performance across a larger machine. There's nothing
intrinsically wrong with this, its a different way to do things. It
also breaks critical assumptions in many HPC areas. I remember, having
written ctblastall in 1999-2000 time frame that seamlessly distributed
blast computations across a cluster, that we started running into issues
with job schedulers. ctblastall would, in some of the larger cases,
divide up the large blast run into tens of thousands of smaller jobs,
and submit them to a cluster. Job schedulers, back then, for the most
part, couldn't handle it. Platform's LSF could. Got a bit sluggish,
but it worked. I included a number of optimizations to try to keep the
throughput high, including early submission of bigger jobs, among other
things like data motion optimization.
Don't disparage this as not being high performance ... it is. Its just
expressed/used differently.
As to the point of code being written in Python/R/... Mebbe thats not
such a good idea (Python). R, Matlab, Octave,... are interpreted as
well. Compiled langs to a VM are ok (Java, Perl6, Julia), but best
performance is going to statically compiled code. This said, the Julia
people are doing their absolute best to be on par with statically
compiled code, and its coming very close.
But in my mind, its still a huge missed opportunity to not be using
something akin to BLAS for bioinfo codes. And somewhat worse, these
groups often are ... er ... influenced by computer science fads ... and
you see that in their code. Some of these are hard to unwind into good
code. The object factory design pattern is a great example of this. I
am not sure why they do this, other than I see lots of academic
collaboration between bioinfo folks and comp sci folks, and the comp sci
folks need to publish papers, not write good code.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf
mailing list