[Beowulf] Pretty Big Data
mathog
mathog at caltech.edu
Mon Jan 25 09:55:46 PST 2016
On 23-Jan-2016 12:00, Lux, Jim (337C) wrote:
> Dumb sequential search in ram is probably faster than a fancy indexing
> scheme on disk.
> And if your data set fits in RAM, why not.
True, but that doesn't mean that there may not be a big payoff for
indexing the file in RAM too. I frequently work with text files having
tens of millions of lines which are retrieved by line number. These
result from text or numeric keys that have been sorted into alphabetical
or numeric order and thereafter are accessed by their position in that
list. So key (non consecutive values) -> key number (consecutive integer
values, 1->N). The files have variable length records so one cannot do
something like
dd if=file.txt bs=80 skip=123456 count=1
to get record 123456 quickly. Dumb sequential searches like
head -123456 file.txt | tail -1
work. Once all of file.txt is cached in memory they are much faster
than the first instance which is read from disk. However, not nearly as
fast as the little "indexed_text" tool I wrote that does:
indexed_text -index -in file.txt #one time to build index
echo 123456 | indexed_text -in file.txt
This is the weakest of all possible indexing schemes, it doesn't index
by line contents, only line positions, but it is very helpful here.
This particular implementation only uses memory mapping, so it will fail
miserably if the file is too large to fit into memory. (I only ever use
this on machines with tons of memory.) This capability isn't something
most people will ever need, but if anybody ever does and stumbles across
this thread, "indexed text" is part of this:
http://sourceforge.net/projects/drmtools/
Regards,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Beowulf
mailing list