[Beowulf] High Performance for Large Database

Robert G. Brown rgb at phy.duke.edu
Wed Oct 27 11:54:04 PDT 2004


On Wed, 27 Oct 2004, Mark Hahn wrote:

> > NONtrivial parallelizations are things like distributing the execution
> > of actual SQL search statements across a cluster.  Whether there is any
> 
> it's easy to imagine that a stream of SQL queries could actually 
> be handled in sort of an adaptive data refinement manner, where most
> of the thought goes in to managing division of the query labor (distributed
> indices searched in parallel, etc) , and in placement of data (especially
> ownership/locking of writable data). I have no idea whether Oracle-level DB's
> do this, but it wouldn't surprise me.  the irony is that most of the thought
> that goes into advanced Beowulf applications is doing exactly this sort of 
> labor/data division/balancing.
> 
> I'd hazard a guess that the place to start putting parallelism in a DB
> is the underlying isam-like table layer...

As always, google is your friend. parallel database algorithms turns up
lots of current work; I'm sure a look at specific open source projects
would turn up more (and maybe more relevant) work.

Some of the tools I turned up in my short former query do exploit the
kind of simple read-only data parallelism you described, though, and
wrap it up all pretty.  For small read only databases (backing e.g. a
website), the very simplest approach is likely to put a full copy of the
DB on each server and distribute the transactions themselves round
robin.  Use rsync to periodically update the images to accomodate
distributed changes, if you permit distributed write and you dare
(merging in changes is nontrivial).  Or write an engine that uses idle
time of the node engines themselves to distribute inserts to be
scheduled.

Google itself is a pretty good example, actually.  Complex searches of a
read-only and truly encyclopediac database.

All I really know is that this is real computer science, and I am only
an amateur at best.  I suspect that large scale database parallelism is
the subject of much current algorithmic research, as parts of the
problem are likely NP complete or at least NP hard.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu





More information about the Beowulf mailing list