[Beowulf] RE: [Bioclusters] FPGAin bioinformatics clusters (again?)

Robert G. Brown rgb at phy.duke.edu
Tue Jan 17 07:43:04 PST 2006

On Mon, 16 Jan 2006, Mike Davis wrote:

> But BLAST is only a small part and argueably the easiest part of genomics 
> work. The advantages of parallelization and/or smp come into play when 
> attempting to assemble the genome. Phred/Phrap can do the work but starts to 
> slow even large machines when your talking 50k+ of sequences (which it wants 
> to be in one folder). A quiz for  the Unix geeks out there, what happens when 
> a folder has 50,000 files in it. Can you say SLOOOOOOOOOWWWW?

Right, but that's why God invented databases and tree structures and the
like.  The problem is in the software design.  One file with 50K
sequences in it will read in very quickly and only has to stat once.
50K sequences in a file apiece will have to stat EACH file, period, and
stat is expensive, especially if you're using e.g. NIS or the like
naively that adds a network hit per new file reference to check up on
all the perms and groups and so on.  It also takes time just to build
the kernel's image of inodes if there are 50K of them in a single
directory -- it DOES have to be read a file with 50K entries just to
start the process of stat'ing them one at a time.  I suspect that this
exceeds the capacity of the kernel to smoothly cache the file data as

A compromise is to organize the files into a tree (if the problem is a
search problem with some sort of ordinal or catagorical data
organization) so that there aren't so many inodes to parse at any given
level.  Whether or not that wins for you probably depends on the task's
organization -- if you can avoid some parts (ideally most!) of the tree
altogether it should be a big win.  Looking things up in a tree or a
hash is much, much more efficient than looking them up in a linear list.

Of course if the task is just reading in each file, one at a time, and
searching all sorts of things WITHIN the file, then tanstaffl -- if the
task is dominated by the actual time spent IN the file doing the
lookups, then you don't have a lot of room to speed it up no matter how
you organize it, although still one big sequential file will be somewhat
faster than many smaller sequential files.  If you were really
interested in minimizing time, you might be able to do a simple forking
off of threads to manage the lookup/stat/open of the next file "in
parallel with" the processing of the current file.

As in:

   Stat/Open file A (costs perhaps ~1 ms latency)
   Read file A into memory (costs filesize/disk bw)
   Fork Stat/Open file B   AND   Process file A
   (issue int/block/return)      (work during block)

The idea here is that the stat/open thread is likely to issue an
interrupt to the disk and then block during the seek, which is
responsible for most of the time (although just reading the inode table
the first time will be a hit as well, we can hope that thereafter it
will be cached).  During the block process control is likely to be
returned to the work thread so that it can proceed in parallel with the
slow old disk.  This might help "hide" most of the latency and shift the
ratio of user to system time a bit further towards user, or might not --
never really tried it.

Which won't help at all if you don't have the source code, but hey,
that's what open vs closed source code is all about.  If you have the
source, you can fix boneheaded design decisions and optimize and improve
-- probably beyond even what I suggest here as I'm not a real computer
scientist and a real one could probably think of even better ways to
proceed.  If you don't have the source, and your only interface to those
that do is some sales guy who thinks inodes are a kind of show-off
intellectual, interrupts are rude at dinner, and blocks are things his
kid plays with at home, well...


> Mike Davis
> Lukasz Salwinski wrote:
>> Michael Will wrote:
>>> I have always been amazed at the promises of massivelyparallel. Now
>>> their
>>> technique is so good they don't even need the source code to
>>> parallelize?
>>> ...but if I tell you how I would have to kill you...
>>> Michael Will 
>> uh.. just a quick comment on bioinformatics and parallelizing things...
>> please note, that most of the bioinformatic problems are already
>> embarrassingly parallel and, with the new genomes showing up at an amazing 
>> rate, getting more and more so. Thus, in most cases, it just
>> doesn't make much sense to parallelize anything - if one's got to
>> run 300x4000 blasts against a library of 300x4000 sequences (ie 300
>> genomes, 4000 genes/proteins, all vs all) the simplest solution -
>> a lot of nodes, blast optimized for a single cpu and a decent queing
>> system will ultimately win (as long as one stays within the same
>> architecture; FPGAs are a diferent story ;o)
>> lukasz
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

More information about the Beowulf mailing list