[Beowulf] Pretty Big Data

Mon Jan 25 09:55:46 PST 2016

On 23-Jan-2016 12:00, Lux, Jim (337C) wrote:

> Dumb sequential search in ram is probably faster than a fancy indexing
> scheme on disk.
> And if your data set fits in RAM, why not.

True, but that doesn't mean that there may not be a big payoff for 
indexing the file in RAM too.  I frequently work with text files having 
tens of millions of lines which are retrieved by line number.  These 
result from text or numeric keys that have been sorted into alphabetical 
or numeric order and thereafter are accessed by their position in that 
list. So key (non consecutive values) -> key number (consecutive integer 
values, 1->N).  The files have variable length records so one cannot do 
something like

    dd if=file.txt bs=80 skip=123456 count=1

to get record 123456 quickly. Dumb sequential searches like

    head -123456 file.txt | tail -1

work.  Once all of file.txt is cached in memory they are much faster 
than the first instance which is read from disk.  However, not nearly as 
fast as the little "indexed_text" tool I wrote that does:

   indexed_text -index -in file.txt #one time to build index
   echo 123456 | indexed_text -in file.txt

This is the weakest of all possible indexing schemes, it doesn't index 
by line contents, only line positions, but it is very helpful here.  
This particular implementation only uses memory mapping, so it will fail 
miserably if the file is too large to fit into memory.  (I only ever use 
this on machines with tons of memory.) This capability isn't something 
most people will ever need, but if anybody ever does and stumbles across 
this thread, "indexed text" is part of this:

   http://sourceforge.net/projects/drmtools/

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech