[Beowulf] RE: [Bioclusters] FPGAin bioinformatics clusters (again?)

Roderick Sprattling roderick at solentas.com
Tue Jan 17 15:42:21 PST 2006


If the sequence database is read-only, one can set up an indexed 
database by storing all sequences in a single file headered with index 
data (key, byte offset, length.) Then it's a matter of opening a file 
once, transforming the index data into an appropriate data structure, 
and fseek/fread within that single file to access data.

For really large datasets it may pay to presort the index as that speeds 
creation of key-ordered data structures when loaded.

Rod


Mike Davis wrote:
> Robert,
> 
> I agree. We want to avoid havbing the OS creating the tree structure 
> necessary to deal with those file sized and avoid stating all of those 
> files. A  user created tree structure would save time, an indexed 
> database would possibly as well.
> 
> Basically, each and every sequence needs to be compared and recompared 
> as the assembly runs. In the end, the goal is an assembled genome with 
> no gaps.
> 
> Better file structures and/or indexing could both be a big help.
> 
> For everyone, these analyses run on a dedicated local filesystem on an 
> SMP Sun with either 32 or 96GB of RAM. The actual genome itself is only 
> a GB at most when assembled.
> 
> 
> Mike Davis
> 
> 
> 
> Robert G. Brown wrote:
> 
>> On Mon, 16 Jan 2006, Mike Davis wrote:
>>
>>> But BLAST is only a small part and argueably the easiest part of 
>>> genomics work. The advantages of parallelization and/or smp come into 
>>> play when attempting to assemble the genome. Phred/Phrap can do the 
>>> work but starts to slow even large machines when your talking 50k+ of 
>>> sequences (which it wants to be in one folder). A quiz for  the Unix 
>>> geeks out there, what happens when a folder has 50,000 files in it. 
>>> Can you say SLOOOOOOOOOWWWW?
>>
>>
>>
>> Right, but that's why God invented databases and tree structures and the
>> like.  The problem is in the software design.  One file with 50K
>> sequences in it will read in very quickly and only has to stat once.
>> 50K sequences in a file apiece will have to stat EACH file, period, and
>> stat is expensive, especially if you're using e.g. NIS or the like
>> naively that adds a network hit per new file reference to check up on
>> all the perms and groups and so on.  It also takes time just to build
>> the kernel's image of inodes if there are 50K of them in a single
>> directory -- it DOES have to be read a file with 50K entries just to
>> start the process of stat'ing them one at a time.  I suspect that this
>> exceeds the capacity of the kernel to smoothly cache the file data as
>> well.
>>
>> A compromise is to organize the files into a tree (if the problem is a
>> search problem with some sort of ordinal or catagorical data
>> organization) so that there aren't so many inodes to parse at any given
>> level.  Whether or not that wins for you probably depends on the task's
>> organization -- if you can avoid some parts (ideally most!) of the tree
>> altogether it should be a big win.  Looking things up in a tree or a
>> hash is much, much more efficient than looking them up in a linear list.
>>
>> Of course if the task is just reading in each file, one at a time, and
>> searching all sorts of things WITHIN the file, then tanstaffl -- if the
>> task is dominated by the actual time spent IN the file doing the
>> lookups, then you don't have a lot of room to speed it up no matter how
>> you organize it, although still one big sequential file will be somewhat
>> faster than many smaller sequential files.  If you were really
>> interested in minimizing time, you might be able to do a simple forking
>> off of threads to manage the lookup/stat/open of the next file "in
>> parallel with" the processing of the current file.
>>
>> As in:
>>
>>   Stat/Open file A (costs perhaps ~1 ms latency)
>>   Read file A into memory (costs filesize/disk bw)
>>   Fork Stat/Open file B   AND   Process file A
>>   (issue int/block/return)      (work during block)
>>
>> The idea here is that the stat/open thread is likely to issue an
>> interrupt to the disk and then block during the seek, which is
>> responsible for most of the time (although just reading the inode table
>> the first time will be a hit as well, we can hope that thereafter it
>> will be cached).  During the block process control is likely to be
>> returned to the work thread so that it can proceed in parallel with the
>> slow old disk.  This might help "hide" most of the latency and shift the
>> ratio of user to system time a bit further towards user, or might not --
>> never really tried it.
>>
>> Which won't help at all if you don't have the source code, but hey,
>> that's what open vs closed source code is all about.  If you have the
>> source, you can fix boneheaded design decisions and optimize and improve
>> -- probably beyond even what I suggest here as I'm not a real computer
>> scientist and a real one could probably think of even better ways to
>> proceed.  If you don't have the source, and your only interface to those
>> that do is some sales guy who thinks inodes are a kind of show-off
>> intellectual, interrupts are rude at dinner, and blocks are things his
>> kid plays with at home, well...
>>
>>    rgb
>>
>>>
>>> Mike Davis
>>>
>>>
>>>
>>>
>>>
>>> Lukasz Salwinski wrote:
>>>
>>>> Michael Will wrote:
>>>>
>>>>> I have always been amazed at the promises of massivelyparallel. Now
>>>>> their
>>>>> technique is so good they don't even need the source code to
>>>>> parallelize?
>>>>>
>>>>> ...but if I tell you how I would have to kill you...
>>>>>
>>>>> Michael Will 
>>>>
>>>>
>>>>
>>>>
>>>> uh.. just a quick comment on bioinformatics and parallelizing things...
>>>>
>>>> please note, that most of the bioinformatic problems are already
>>>> embarrassingly parallel and, with the new genomes showing up at an 
>>>> amazing rate, getting more and more so. Thus, in most cases, it just
>>>> doesn't make much sense to parallelize anything - if one's got to
>>>> run 300x4000 blasts against a library of 300x4000 sequences (ie 300
>>>> genomes, 4000 genes/proteins, all vs all) the simplest solution -
>>>> a lot of nodes, blast optimized for a single cpu and a decent queing
>>>> system will ultimately win (as long as one stays within the same
>>>> architecture; FPGAs are a diferent story ;o)
>>>>
>>>> lukasz
>>>>
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit 
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
> 



More information about the Beowulf mailing list