[Beowulf] RE: [Bioclusters] FPGAin bioinformatics clusters (again?)

Mon Jan 16 18:26:49 PST 2006

Mike Davis wrote:
> Exactly Joe, these codes were written (often in Perl) to solve smaller 
> problems. They may have never envisioned 50k+ sequences to be assembled. 
> I would also point out that much of bioinformatics is performed on flat 
> text file databases.

nt is 14+GB last I looked.  Flat file.  Not so much a database as a data 
file.

> Those of us who also work with physicists and chemists know that their 
> software and algorithms are tweaked for very good if not excellent 
> performance. Much of that work was done (in fortran) in the 70's and 
> 80's. But the field of bioinformatics is so new that no one has made 
> those types of optimizations (as far as I know for many of the programs).

So there is something of significant interest here.  What makes 
physicists and chemists work fast is in part that they can standardize 
their calculations in terms of smaller mathematical or algorithmic 
primatives, and then build upon that.  You don't need to write your own 
singular value decomposition, or complex hermitian eigenvalue equation 
solver, you can use the prepackaged ones (from NetLib, specifically 
BLAS/LAPACK and friends), with a well accepted standard interface.  Each 
of these algorithms is the basis for lots of other calculations.

There are other reasons as well, but the standardization of various 
components made life easier.  This way even if everyone wants to write 
their own self-consistent field code, they can utilize several well 
documented and tested core algorithms that happen to be really fast.

This sort of standardization of various lower level calculations might 
benefit informatics as well, though this also requires some level of 
standardization of data types and representation.

> There's alot of room for improvement. We looked at the Paracel 
> solutions, but just couldn't justify the cost. Instead, we make use of a 
> combination of machines. Embarassingly parallel work like BLAST, runs on 
> the same clusters that run G03, GAMESS and FEMAP. The assemblies run on 
> large SMP Suns with 32-96GB of RAM, and large disk filesystems.

So this begs the question.  What would represent an improvement?  Our 
customers running most of these codes seem to be fairly happy, though 
g03 and GAMESS consume some serious resources.  Depends upon the 
calculation of course.

Throwing more memory at some problems helps (large coupled cluster 
codes).  Where are the pain points?  Apart from the file system (and if 
this is it, tell us more about which FS you are using).

Joe

> 
> Mike Davis
> 
> Joe Landman wrote:
> 
>> Hi Craig:
>>
>> Craig Tierney wrote:
>>
>>> Mike Davis wrote:
>>>
>>>> But BLAST is only a small part and argueably the easiest part of 
>>>> genomics work. The advantages of parallelization and/or smp come 
>>>> into play when attempting to assemble the genome. Phred/Phrap can do 
>>>> the work but starts to slow even large machines when your talking 
>>>> 50k+ of sequences (which it wants to be in one folder). A quiz for  
>>>> the Unix geeks out there, what happens when a folder has 50,000 
>>>> files in it. Can you say SLOOOOOOOOOWWWW?
>>>>
>>> First, pick the right filesystem.
>>> Second, rewrite your code so you don't have 50k+ files in one directory.
>>> There must be some straightforward way to solve the problem if
>>> you have too many files in one directory.
>>
>>
>> Lots of the informatics codes were not written with such input (or 
>> database) scaling in mind.  For them, 10-100 files in a directory 
>> isn't much of a problem.  Its when you start to scale up that the bugs 
>> and surprises start.
>>
>>
>> Joe
>>

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615