[Beowulf] Pretty Big Data
Lux, Jim (337C)
james.p.lux at jpl.nasa.gov
Mon Jan 25 07:06:44 PST 2016
Yes, there is a cross over point.. Which can be pushed around by one
specialized optimization or another (caching, read ahead, etc.)
Figure that RAM has a read cycle time of 1 ns (DDR3 or DDR4)
High speed disk drives (in the spinning rust, sense) in ³commodity² kinds
of speeds seem to be in the 7200 RPM range, which is a 4 millisecond
average latency.
So that¹s about a 4 million: 1 ratio.
You could also look at it in raw bits across the wire(s): memory has a
parallel interface, most disk has a serial interface (it has to be read
off the platter serially, if nothing else).
So, if we¹re comparing searching a TB of data in RAM vs searching a TB of
data on a disk, I think the RAM is always going to win if it¹s a
sequential search.
The more important question (and the one brought up by the original post)
was the other costs: implementing and testing something that makes the
disk more efficient, be it indices, careful operation ordering to avoid
rotational latency, fancier hardware that does this for you. Smart data
structures require more work and testing than simple stupid data
structures. For a lot of applications, fast(implementation), simple, and
maybe O(N) might be a better approach than slower, complex, and O(LogN).
One of the cool things about cheap hardware is that you can use strategies
like map/reduce that aren¹t necessarily ³efficient² in terms of number of
gates or instructions executed per task. Dividing up the raw data among a
bunch of independent relatively cheap/simple/stupid searchers is a clever
idea, and may in the long run be more scalable than trying to develop
O(logN) kinds of algorithms. (Of course, one can argue that breaking up
the data into independent chunks is, in itself, a O(logN) kind of
strategy).
You can buy/rent/use a lot of machines for the cost of a work year¹s
labor; and it¹s possible that even if the simple approach is slower,
you¹ll still be done sooner, because you can start working *now* vs *in a
year*.
On 1/24/16, 7:37 AM, "Perry E. Metzger" <perry at piermont.com> wrote:
>On Sat, 23 Jan 2016 15:14:35 +0000 "Lux, Jim (337C)"
><james.p.lux at jpl.nasa.gov> wrote:
>> Dumb sequential search in ram is probably faster than a fancy
>> indexing scheme on disk.
>
>Probably not. Eventually, log(n) vs n still bites you in the end if
>you only get to a large enough data set. RAM will let you survive with
>very stupid data structures only to a point.
>
>Reading sequentially through a couple of terabytes of RAM can take a
>very long time (relatively speaking) compared to a read or two from
>SSD. Especially if you're doing this operation many many millions
>of times you'll eventually feel quite serious pain.
>
>The answer, of course, is to use a smart data structure in memory and
>then you'll blow even an SSD every time. :)
>
>Perry
>--
>Perry E. Metzger perry at piermont.com
More information about the Beowulf
mailing list