[Beowulf] Slide on big data
Mark Hahn
hahn at mcmaster.ca
Tue Feb 18 22:58:10 PST 2014
> Pardon me, what exactly IS Big Data :)
ya smiley, but I think it's worth trying to put words to it.
mostly, I think BD is really "Many Data": it's not really
about the absolute scale. If I run a big simulation that
writes 10 TB of checkpoints every cycle, that's reasonably
large data. in a sense, I've got just one unit of data per node,
so not really "many". Or if I'm doing lookups in some giant
business DB - the tables may be quite large, but I'm probably
doing low-cardinality selects and joins (indices FTW!).
in a sense, you have BG when your data and performance controls
the design of your clusters. you may have a very trad DB that's
implemented across more than one node, but it's probably not
a thousand nodes with gigabit - the latter is probably BD.
I often think of BD and Data Mining as being quite closely linked.
But I don't think I'd want to say that all BD is for DM...
regards, mark hahn.
More information about the Beowulf
mailing list