[Beowulf] High Performance for Large Database

Kumaran Rajaram kums at mpi.mpi-softtech.com
Mon Nov 15 08:53:30 PST 2004

   Imho, the I/O workload in databases are dominated by random, small
block-sized requests.  In order to cater such I/O pattern, nodes
hosting databases tend to have large caches (RAM) and are SMP-based.
The database software implement proprietary storage/access policies for
high performance. In this sense, databases mostly require block-device
interface from the storage system than file-system interface. File-system
interface should also work although performance-wise, you tend to add
additional layer and are restricted by file system storage policies. The
pros is that file-system aggregates the storage and provides a single
namespace, making it easier to manage + backup the data.

   In terms of block-devices, SAN provides low latency, high bandwidth,
and high availability ideal for database environment. For moderates
prices, iSCSI SAN may be used instead of FC SAN. SAN also makes management
of block-devices easier. The only caveat is that the maximum size of the
block-device is 2TB in 2.4 kernel. 2.6 kernel extends this to 16TB.

   PVFS/Lustre are currently tuned for HPC style applications which are
dominated by large, contiguous I/O requests and the file-system striping
policies helps to provide higher bandwidth. However, for small-sized
requests, striping may not prove beneficial. Also, most file-systems use
TCP/IP, hence the network layer latency can affect database performance.
MPI-IO interface may be used to optimize non-contiguous, smaller requests
through its datatype and file-view features. Newer PVFS/Lustre versions
offer native implementation for low-latency interconnects like Myrinet,
IB, or Quadrics, however, the stability of the file-system needs to be

   Consistency, intergrity, and availability of data cannot be compromised
in databases. Current PVFS/Lustre versions stripes files across their I/O
nodes in RAID-0 pattern. Going down another level, hardware or
software RAID 1/5 can be performed at the disk level, resulting in file
system providing RAID 10/50. However, the failure of a single I/O nodes
might lead to temporary loss (data in cache)/unavailability of file-data
until the node is revived. RAID1/5 across I/O nodes is planned in future

   Price, Performance, Availability, Manageability, and Consistency of
file-data need to weighed when architecting the Database solution.


On Mon, 15 Nov 2004, Laurence Liew wrote:

> Hi
> The current version of GFS have a 64 node limit.. something to do with
> maximum number of connections thru a SAN switch.
> I believe the limit could be removed in RHEL v4.
> BTW, GFS was built for enterprise and not specifically for HPC... the
> use of SAN (all nodes need to be connected to a single SAN storage)..
> may be a bottleneck...
> I would still prefer the model of PVFS1/2 and Lustre where the data is
> distributed amongst the compute nodes
> I suspect GFS could prove useful however for enterprise clusters say 32
> - 128 nodes where the number of IO nodes (GFS nodes with exported NFS)
> can be small (less than 8 nodes)... it could work well
> Cheers!
> Laurence
> Chris Samuel wrote:
> > On Wed, 10 Nov 2004 12:08 pm, Laurence Liew wrote:
> >
> >
> >>You may wish to try GFS (open sourced by Red Hat after buying
> >>Sistina)... it may give better performance.
> >
> >
> > Anyone here using the GPL'd version of GFS on large clusters ?
> >
> > Be really interested to hear how folks find that..
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> --
> Laurence Liew, CTO		Email: laurence at scalablesystems.com
> Scalable Systems Pte Ltd	Web  : http://www.scalablesystems.com
> (Reg. No: 200310328D)
> 7 Bedok South Road		Tel  : 65 6827 3953
> Singapore 469272		Fax  : 65 6827 3922
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list