[Beowulf] Maker2 genomic software license experience?

Lux, Jim (337C) james.p.lux at jpl.nasa.gov
Fri Nov 9 06:10:24 PST 2012

On 11/9/12 4:59 AM, "Joe Landman" <landman at scalableinformatics.com> wrote:

>On 11/09/2012 06:20 AM, Igor Kozin wrote:
>> You nailed it! And not just the code, new codes appear all the time.
>> bowtie, bwa, soap2, soap3, bowtie2, snap ..
>This was one of the major issues we found when we were pitching
>accelerators to VCs in the early 2000's.  There were, at the time,
>something on the order of 200 phylogenetic codes.  Many alignment codes.
>   Many of code type X.
>Seemed that every lab had/used something different, so an accelerator
>had to be able to work on a generic set of problems, without complex
>rewriting of code.
>Moreover, and this might be my biases showing, the code quality wasn't
>... erm ... high. 
>Unfortunately, I see a rather significant "not invented here" viewpoint
>in some groups, and its not lacking here.  Which means, in many cases,
>there are huge efforts expended to re-invent core algorithms.  Less
>effort to build tools atop another set of tools.

Or, this is probably a manifestation of the people who have the domain
expertise not coming from a background that is software rich.

If you looked at, say, numerical codes for finite element analysis, the
people doing that have been using computers for decades, so there's a
goodly number of people who have gone through the learning curve of "roll
your own" vs "use the library".. Or, even if they're not actually doing
it, they're working with other people who are doing it, so they pick up
"good design" by osmosis if nothing else.

 Engineers and physicists and such have had to take programming classes
since the 70s (at least), bio majors did not.  My daughter is a mol bio
major on a premed track at JHU, and there's no requirement for a
programming class. There are some electives (among about 250) that are
things like "bioinformatics and genetics" but I'm going to bet there's no
slinging of code there. I tried to convince her that she should at *least*
take a class in Matlab or Python, but the reality is that premeds optimize
their course load for perceived attractiveness to med school admissions.
They're more focused on getting a research opportunity in a lab.  Sure,
you might wind up doing some programming in such a setting (at least I see
resumes for people who claim to have done so), but I get the impression
that "developed efficient protein folding code" doesn't carry as much
weight as "spent 6 months using a micropipette and running gels".

The biology world is not one conducive to inculcating good software
development practices.

The other thing is that a lot of codes (I don't know about the biology
space, but certainly in engineering) are rarely from scratch.  It starts
as someone else's code that you modify, and then someone else modifies,
etc.  Over time, I think that process also leads to better design: the
nasty ones tend to die out, the good ones persist and are reused.

>As many have noted, the parallelism is expressed through a scheduler.  I
>remember calling this style of computing:  high throughput.  That is,
>more parallelism by wider distribution of computation, not maximization
>of single run performance across a larger machine.  There's nothing
>intrinsically wrong with this, its a different way to do things.  It
>also breaks critical assumptions in many HPC areas.  I remember, having
>written ctblastall in 1999-2000 time frame that seamlessly distributed
>blast computations across a cluster, that we started running into issues
>with job schedulers.  ctblastall would, in some of the larger cases,
>divide up the large blast run into tens of thousands of smaller jobs,
>and submit them to a cluster.  Job schedulers, back then, for the most
>part, couldn't handle it.  Platform's LSF could.  Got a bit sluggish,
>but it worked.  I included a number of optimizations to try to keep the
>throughput high, including early submission of bigger jobs, among other
>things like data motion optimization.
>Don't disparage this as not being high performance ... it is.  Its just
>expressed/used differently.
>As to the point of code being written in Python/R/...  Mebbe thats not
>such a good idea (Python).  R, Matlab, Octave,... are interpreted as
>well.  Compiled langs to a VM are ok (Java, Perl6, Julia), but best
>performance is going to statically compiled code.  This said, the Julia
>people are doing their absolute best to be on par with statically
>compiled code, and its coming very close.
>But in my mind, its still a huge missed opportunity to not be using
>something akin to BLAS for bioinfo codes.  And somewhat worse, these
>groups often are ... er ... influenced by computer science fads ... and
>you see that in their code.  Some of these are hard to unwind into good
>code.  The object factory design pattern is a great example of this.  I
>am not sure why they do this, other than I see lots of academic
>collaboration between bioinfo folks and comp sci folks, and the comp sci
>folks need to publish papers, not write good code.
>Joseph Landman, Ph.D
>Founder and CEO
>Scalable Informatics Inc.
>email: landman at scalableinformatics.com
>web  : http://scalableinformatics.com
>        http://scalableinformatics.com/sicluster
>phone: +1 734 786 8423 x121
>fax  : +1 866 888 3112
>cell : +1 734 612 4615
>Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>To change your subscription (digest mode or unsubscribe) visit

More information about the Beowulf mailing list