[Beowulf] High Performance for Large Database

Thu Oct 28 03:03:56 PDT 2004

On Tue, Oct 26, 2004 at 01:08:00PM -0600, Joshua Marsh wrote:
> Hi all,
> 
> I'm currently working on a project that will require fast access to
> data stored in a postgreSQL database server.  I've been told that a
> Beowulf cluster may help increase performance.  Since I'm not very
> familar with Beowulf clusters, I was hoping that you might have some
> advice or information on whether a cluster would increase performance
> for a PostgreSQL database.  The major tables accessed are around
> 150-200 million records.  On a stand alone server, it can take several
> minutes to perform a simple select query.
> 
> It seems like once we start pricing for servers with 16+ processors
> and 64+ GB of RAM, the prices sky rocket.  If I can acheive high
> performance with a cluster, using 15-20 dual processor machines, that
> would be great.

It depends.  I was involved in one project where we had some hosts doing
a *massive* number of queries against postgres, but no or few updates.

This parallelizes very well.  A single quiery would not run faster, but
when you run thousands of queries, running them against a cluster of
postgresql databases will even out the load just nicely, giving you
linear scaling (sustained queries per second versus machines in the
cluster).

I don't think you'll have any luck finding off-the-shelf
production-quality database software that will parallelize a single
query on a number of nodes.

If you just want throughput, large numbers of queries on a large number
of databases, and you are doing mostly selects with very few (if any)
updates/inserts/deletes, then PostgreSQL comes with software that can
help you mirror your database.

What you do is, you have a 'master' database - you will perform all
updates/deletes/inserts against this master.

The master will relay updates to a number of slave databases.

You perform all selects against the slaves.

Simply, stable, and works perfectly within the limits inherent in such a
setup (eg. a single query won't parallelize, the master cannot scale to
more updates than what is possible on a single system, etc.)

-- 

 / jakob