[Beowulf] Question on hgh performance, low cost Fileserver

Mon Nov 14 01:46:03 PST 2005

On Thu, 10 Nov 2005, ar 3107 wrote:

> We are looking into designing a low cost, high performance storage system.
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^

Unfortunately, low cost, high performance and reliable are mutually
exclusive in the storage world; you might have to make some compromises.

> PVFS is not reliable enough for home dirs (OK for scratch),

GPFS cannot do
> RAID5 like striping across nodes, needs SAN for RAID1 like mirroring (cost
> $$$)

You are right that you can't do RAID5 striping across GPFS, but you can do
raid 1 mirroring without SAN.

You can set up a topology with two NSD servers with  locally attached
disks.

     Cluster nodes
        |
        |
-------Network-----------
|                       |
|                       |
Server1                 Server2
|                       |
|                       |
Local disks             More local disks.

You then create two NSDs from the two sets of local disks. Put the disks
from server 1 and server 2 into their own failure groups, then then create
a filesystem with Data and MetaData replicas=2.

You've now got a redundant GPFS filesystem: You can now lose server1 or
server 2 and keep on going. Obviously you need twice as much disk, but you
don't need fibre or dual homed scsi. And you can't get much cheaper in
storage terms than locally attached disk.

In gpfs-speak, your nsd config file should look like this:

/dev/sda1:server1::dataAndMetadata:1::
/dev/sda1:server2::dataAndMetadata:2::

and your mmcrfs command will be

mmcrfs -m 2 -r 2 -M2 -R2 ....

You could then replicate out the pairs of servers until you have enough IO
bandwidth for your applications and/or add more disks to your NSD servers
to give you more space.

You can stripe across  4 NSD servers, you'd create a filesytem like this:

/dev/sda1:server1::dataAndMetadata:1::
/dev/sda1:server2::dataAndMetadata:2::
/dev/sda1:server3::dataAndMetadata:3::
/dev/sda1:server4::dataAndMetadata:4::

Note that GPFS only allows you to create a maximum of 2 data replicas;
that means you can only lose 1 NSD server in each filesystem and keep on
going. So the more NSD servers you add, the greater your IO bandwidth, but
the greater your risk of getting a double failure on the filesystem.

If you want more reliability,  then you need dual attached scsi or SAN
storage.

> Is GFS or Lustre suitable for the above needs? Any other commercial solution?

GFS has a max node size of 64 nodes, so you might run into problems when
you expand out.

Lustre and GPFS are almost feature equivalent, (although the back
end architecture is quite different).

Although you can get a working lustre config without dual attached scsi
or SAN disks, if you want reliability, it is hard to do without.

Cheers,

Guy

-- 
Dr. Guy Coates,  Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 x 6925
Fax: +44 (0)1223 494919