[Beowulf] RE: [Bioclusters] FPGAin bioinformatics clusters (again?)
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Joe Landman landman at scalableinformatics.comMon Jan 16 18:26:49 PST 2006
- Previous message: [Beowulf] RE: [Bioclusters] FPGAin bioinformatics clusters (again?)
- Next message: [Beowulf] scaling
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Mike Davis wrote: > Exactly Joe, these codes were written (often in Perl) to solve smaller > problems. They may have never envisioned 50k+ sequences to be assembled. > I would also point out that much of bioinformatics is performed on flat > text file databases. nt is 14+GB last I looked. Flat file. Not so much a database as a data file. > Those of us who also work with physicists and chemists know that their > software and algorithms are tweaked for very good if not excellent > performance. Much of that work was done (in fortran) in the 70's and > 80's. But the field of bioinformatics is so new that no one has made > those types of optimizations (as far as I know for many of the programs). So there is something of significant interest here. What makes physicists and chemists work fast is in part that they can standardize their calculations in terms of smaller mathematical or algorithmic primatives, and then build upon that. You don't need to write your own singular value decomposition, or complex hermitian eigenvalue equation solver, you can use the prepackaged ones (from NetLib, specifically BLAS/LAPACK and friends), with a well accepted standard interface. Each of these algorithms is the basis for lots of other calculations. There are other reasons as well, but the standardization of various components made life easier. This way even if everyone wants to write their own self-consistent field code, they can utilize several well documented and tested core algorithms that happen to be really fast. This sort of standardization of various lower level calculations might benefit informatics as well, though this also requires some level of standardization of data types and representation. > There's alot of room for improvement. We looked at the Paracel > solutions, but just couldn't justify the cost. Instead, we make use of a > combination of machines. Embarassingly parallel work like BLAST, runs on > the same clusters that run G03, GAMESS and FEMAP. The assemblies run on > large SMP Suns with 32-96GB of RAM, and large disk filesystems. So this begs the question. What would represent an improvement? Our customers running most of these codes seem to be fairly happy, though g03 and GAMESS consume some serious resources. Depends upon the calculation of course. Throwing more memory at some problems helps (large coupled cluster codes). Where are the pain points? Apart from the file system (and if this is it, tell us more about which FS you are using). Joe > > Mike Davis > > Joe Landman wrote: > >> Hi Craig: >> >> Craig Tierney wrote: >> >>> Mike Davis wrote: >>> >>>> But BLAST is only a small part and argueably the easiest part of >>>> genomics work. The advantages of parallelization and/or smp come >>>> into play when attempting to assemble the genome. Phred/Phrap can do >>>> the work but starts to slow even large machines when your talking >>>> 50k+ of sequences (which it wants to be in one folder). A quiz for >>>> the Unix geeks out there, what happens when a folder has 50,000 >>>> files in it. Can you say SLOOOOOOOOOWWWW? >>>> >>> First, pick the right filesystem. >>> Second, rewrite your code so you don't have 50k+ files in one directory. >>> There must be some straightforward way to solve the problem if >>> you have too many files in one directory. >> >> >> Lots of the informatics codes were not written with such input (or >> database) scaling in mind. For them, 10-100 files in a directory >> isn't much of a problem. Its when you start to scale up that the bugs >> and surprises start. >> >> >> Joe >> -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615
- Previous message: [Beowulf] RE: [Bioclusters] FPGAin bioinformatics clusters (again?)
- Next message: [Beowulf] scaling
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
