[Beowulf] High Performance for Large Database

Mark Hahn hahn at physics.mcmaster.ca
Wed Oct 27 10:25:58 PDT 2004


> relationships regarding access to the HPC-generated data, a DB is needed
> just to permit search and retrieval of your OWN results, let alone
> somebody else's.

right.  the distinction here is that HPC and filesystems tend to have 
a very simple DB schema ;)

> Writing a PARALLEL SQL database server is even MORE nontrivial, and
> while yes, some reasons for this are shared by the HPC community, the
> bulk of them are related directly to locking and the file system and to
> SQL itself.

depends.  for instance, it's not *that* uncommon to have DB's which 
see almost nothing but read-only queries (and updates, if they happen
at all, can be batched during an off-time.)  that makes a parallel 
version quite easy, actually: imagine a bunch of 8GB dual-opterons
running queries on a simple NFS v3 server over Myrinet.  for a read-mostly
load, especially one with enough locality to make 8GB caches effective,
this would probably *fly*.  tweak it with iSCSI and go to 64 GB quad-
opterons.  how many tables out there wouldn't have a good hit rate
in 64GB?

> NONtrivial parallelizations are things like distributing the execution
> of actual SQL search statements across a cluster.  Whether there is any

it's easy to imagine that a stream of SQL queries could actually 
be handled in sort of an adaptive data refinement manner, where most
of the thought goes in to managing division of the query labor (distributed
indices searched in parallel, etc) , and in placement of data (especially
ownership/locking of writable data). I have no idea whether Oracle-level DB's
do this, but it wouldn't surprise me.  the irony is that most of the thought
that goes into advanced Beowulf applications is doing exactly this sort of 
labor/data division/balancing.

I'd hazard a guess that the place to start putting parallelism in a DB
is the underlying isam-like table layer...




More information about the Beowulf mailing list