[Beowulf] Maker2 genomic software license experience?

Fri Nov 9 04:59:44 PST 2012

On 11/09/2012 06:20 AM, Igor Kozin wrote:
> You nailed it! And not just the code, new codes appear all the time.
> bowtie, bwa, soap2, soap3, bowtie2, snap ..

This was one of the major issues we found when we were pitching 
accelerators to VCs in the early 2000's.  There were, at the time, 
something on the order of 200 phylogenetic codes.  Many alignment codes. 
   Many of code type X.

Seemed that every lab had/used something different, so an accelerator 
had to be able to work on a generic set of problems, without complex 
rewriting of code.

Moreover, and this might be my biases showing, the code quality wasn't 
... erm ... high.  Bad design patterns were in use, if any.  We ran into 
one proteomic code whose authors/users claimed they had a great 
threading model, only to look in abject horror at object factories deep 
in nested loops.

It was rare, with pretty much the exception of the HMMer code by Sean 
Eddy, that people were focused on performance with good 
coding/design/implementation.  We did some pretty simple recoding of 
elements of this, and part of our larger group did some MPI and GPU 
work.  Many of the concepts were folded into HMMer 3 (current version, 
should be the one people use).  But this was done as he needed the tool 
to be faster, and he paid attention to the issues associated with this.

This had not been true of (most of) the rest the last time I looked.

This is important as one of the things that would tremendously benefit 
this community are libraries of routines that can be reviewed, 
collected, and used similar to BLAS.  Then rather than writing your own 
implementation of a particular algorithm, leverage the basic working 
plumbing of these libraries to build your code (similar to LAPACK, etc.)

Unfortunately, I see a rather significant "not invented here" viewpoint 
in some groups, and its not lacking here.  Which means, in many cases, 
there are huge efforts expended to re-invent core algorithms.  Less 
effort to build tools atop another set of tools.

As many have noted, the parallelism is expressed through a scheduler.  I 
remember calling this style of computing:  high throughput.  That is, 
more parallelism by wider distribution of computation, not maximization 
of single run performance across a larger machine.  There's nothing 
intrinsically wrong with this, its a different way to do things.  It 
also breaks critical assumptions in many HPC areas.  I remember, having 
written ctblastall in 1999-2000 time frame that seamlessly distributed 
blast computations across a cluster, that we started running into issues 
with job schedulers.  ctblastall would, in some of the larger cases, 
divide up the large blast run into tens of thousands of smaller jobs, 
and submit them to a cluster.  Job schedulers, back then, for the most 
part, couldn't handle it.  Platform's LSF could.  Got a bit sluggish, 
but it worked.  I included a number of optimizations to try to keep the 
throughput high, including early submission of bigger jobs, among other 
things like data motion optimization.

Don't disparage this as not being high performance ... it is.  Its just 
expressed/used differently.

As to the point of code being written in Python/R/...  Mebbe thats not 
such a good idea (Python).  R, Matlab, Octave,... are interpreted as 
well.  Compiled langs to a VM are ok (Java, Perl6, Julia), but best 
performance is going to statically compiled code.  This said, the Julia 
people are doing their absolute best to be on par with statically 
compiled code, and its coming very close.

But in my mind, its still a huge missed opportunity to not be using 
something akin to BLAS for bioinfo codes.  And somewhat worse, these 
groups often are ... er ... influenced by computer science fads ... and 
you see that in their code.  Some of these are hard to unwind into good 
code.  The object factory design pattern is a great example of this.  I 
am not sure why they do this, other than I see lots of academic 
collaboration between bioinfo folks and comp sci folks, and the comp sci 
folks need to publish papers, not write good code.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615