[Beowulf] RE: [Bioclusters] FPGAin bioinformatics clusters (again?)
Roderick Sprattling
roderick at solentas.com
Tue Jan 17 15:42:21 PST 2006
If the sequence database is read-only, one can set up an indexed
database by storing all sequences in a single file headered with index
data (key, byte offset, length.) Then it's a matter of opening a file
once, transforming the index data into an appropriate data structure,
and fseek/fread within that single file to access data.
For really large datasets it may pay to presort the index as that speeds
creation of key-ordered data structures when loaded.
Rod
Mike Davis wrote:
> Robert,
>
> I agree. We want to avoid havbing the OS creating the tree structure
> necessary to deal with those file sized and avoid stating all of those
> files. A user created tree structure would save time, an indexed
> database would possibly as well.
>
> Basically, each and every sequence needs to be compared and recompared
> as the assembly runs. In the end, the goal is an assembled genome with
> no gaps.
>
> Better file structures and/or indexing could both be a big help.
>
> For everyone, these analyses run on a dedicated local filesystem on an
> SMP Sun with either 32 or 96GB of RAM. The actual genome itself is only
> a GB at most when assembled.
>
>
> Mike Davis
>
>
>
> Robert G. Brown wrote:
>
>> On Mon, 16 Jan 2006, Mike Davis wrote:
>>
>>> But BLAST is only a small part and argueably the easiest part of
>>> genomics work. The advantages of parallelization and/or smp come into
>>> play when attempting to assemble the genome. Phred/Phrap can do the
>>> work but starts to slow even large machines when your talking 50k+ of
>>> sequences (which it wants to be in one folder). A quiz for the Unix
>>> geeks out there, what happens when a folder has 50,000 files in it.
>>> Can you say SLOOOOOOOOOWWWW?
>>
>>
>>
>> Right, but that's why God invented databases and tree structures and the
>> like. The problem is in the software design. One file with 50K
>> sequences in it will read in very quickly and only has to stat once.
>> 50K sequences in a file apiece will have to stat EACH file, period, and
>> stat is expensive, especially if you're using e.g. NIS or the like
>> naively that adds a network hit per new file reference to check up on
>> all the perms and groups and so on. It also takes time just to build
>> the kernel's image of inodes if there are 50K of them in a single
>> directory -- it DOES have to be read a file with 50K entries just to
>> start the process of stat'ing them one at a time. I suspect that this
>> exceeds the capacity of the kernel to smoothly cache the file data as
>> well.
>>
>> A compromise is to organize the files into a tree (if the problem is a
>> search problem with some sort of ordinal or catagorical data
>> organization) so that there aren't so many inodes to parse at any given
>> level. Whether or not that wins for you probably depends on the task's
>> organization -- if you can avoid some parts (ideally most!) of the tree
>> altogether it should be a big win. Looking things up in a tree or a
>> hash is much, much more efficient than looking them up in a linear list.
>>
>> Of course if the task is just reading in each file, one at a time, and
>> searching all sorts of things WITHIN the file, then tanstaffl -- if the
>> task is dominated by the actual time spent IN the file doing the
>> lookups, then you don't have a lot of room to speed it up no matter how
>> you organize it, although still one big sequential file will be somewhat
>> faster than many smaller sequential files. If you were really
>> interested in minimizing time, you might be able to do a simple forking
>> off of threads to manage the lookup/stat/open of the next file "in
>> parallel with" the processing of the current file.
>>
>> As in:
>>
>> Stat/Open file A (costs perhaps ~1 ms latency)
>> Read file A into memory (costs filesize/disk bw)
>> Fork Stat/Open file B AND Process file A
>> (issue int/block/return) (work during block)
>>
>> The idea here is that the stat/open thread is likely to issue an
>> interrupt to the disk and then block during the seek, which is
>> responsible for most of the time (although just reading the inode table
>> the first time will be a hit as well, we can hope that thereafter it
>> will be cached). During the block process control is likely to be
>> returned to the work thread so that it can proceed in parallel with the
>> slow old disk. This might help "hide" most of the latency and shift the
>> ratio of user to system time a bit further towards user, or might not --
>> never really tried it.
>>
>> Which won't help at all if you don't have the source code, but hey,
>> that's what open vs closed source code is all about. If you have the
>> source, you can fix boneheaded design decisions and optimize and improve
>> -- probably beyond even what I suggest here as I'm not a real computer
>> scientist and a real one could probably think of even better ways to
>> proceed. If you don't have the source, and your only interface to those
>> that do is some sales guy who thinks inodes are a kind of show-off
>> intellectual, interrupts are rude at dinner, and blocks are things his
>> kid plays with at home, well...
>>
>> rgb
>>
>>>
>>> Mike Davis
>>>
>>>
>>>
>>>
>>>
>>> Lukasz Salwinski wrote:
>>>
>>>> Michael Will wrote:
>>>>
>>>>> I have always been amazed at the promises of massivelyparallel. Now
>>>>> their
>>>>> technique is so good they don't even need the source code to
>>>>> parallelize?
>>>>>
>>>>> ...but if I tell you how I would have to kill you...
>>>>>
>>>>> Michael Will
>>>>
>>>>
>>>>
>>>>
>>>> uh.. just a quick comment on bioinformatics and parallelizing things...
>>>>
>>>> please note, that most of the bioinformatic problems are already
>>>> embarrassingly parallel and, with the new genomes showing up at an
>>>> amazing rate, getting more and more so. Thus, in most cases, it just
>>>> doesn't make much sense to parallelize anything - if one's got to
>>>> run 300x4000 blasts against a library of 300x4000 sequences (ie 300
>>>> genomes, 4000 genes/proteins, all vs all) the simplest solution -
>>>> a lot of nodes, blast optimized for a single cpu and a decent queing
>>>> system will ultimately win (as long as one stays within the same
>>>> architecture; FPGAs are a diferent story ;o)
>>>>
>>>> lukasz
>>>>
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
More information about the Beowulf
mailing list