[Beowulf] High performance storage with GbE?

Thu Dec 14 10:20:10 PST 2006

Thanks Bill.  This is really helpful.

On Wed, 13 Dec 2006, Bill Broadley wrote:

> What do you expect the I/O's to look like?  Large file read/writes?  Zillions
> of small reads/writes?  To one file or directory or maybe to a file or
> directory per compute node?

We are basing our specs on large file use.  The cluster is used for many 
things so I'm sure there will be some cases where small file writes will 
be done.  Most of the work I do deals with large file reads and 
writes and that is what we are basing our desired performance on.  I don't 
think we can afford to try to get this type of bandwidth for multiple 
small file writes.

> My approach so far as been to buy N dual opterons with 16 disks in each
> (using the areca or 3ware controllers) and use NFS.  Higher end 48 port
> switches come with 2-4 10G uplinks.  Numerous disk setups these days
> can sustain 800MB/sec (Dell MD-1000 external array, Areca 1261ML, and the
> 3ware 9650SE) all of which can be had in a 15/16 disk configuration for
> $8-$14k depending on the size of your 16 disks (400-500GB towards the lower
> end, 750GB towards the higher end).

Do you have a system like this in place right now?

> NFS would be easy, but any collection of clients (including all) would be
> performance limited by a single server.

This would be a problem, but...

> PVFS2 or Lustre would allow you to use N of the above file servers and
> get not too much less than N times the bandwidth (assuming large sequential
> reads and writes).

... this sounds hopeful.  How managable is this? Is it something that 
would take a FTE to keep going with 9 of these systems?  I guess it 
depends on the systems themselves and how much fault tolerance there is.

> In particular the Dell MD-1000 is interesting in that it allows for 2 12Gbit
> connections (via SAS), the docs I've found show you can access all 15
> disks via a single connection or 7 disks on one, and 8 disks on the other.
> I've yet to find out if you can access all 15 disks via both interfaces
> to allow fallover in case one of your fileservers dies.  As previously
> mentioned both PVFS2 and Lustre can be configured to handle this situation.
>
> So you could buy a pair of dual opterons + SAS card (with 2 external
> conenctions) then connect each port to each array (both servers to
> both connections), then if a single server fails the other can take
> over the other servers disks.
>
> A recent quote showed that for a config like this (2 servers 2 arrays) would
> cost around $24k.  Assuming one spare disk per chassis, and a 12+2 RAID6 array
> and provide 12TB usable (not including 5% for filesystem overhead).

Are the 1 TB drives out now?  With 750 GB drives wouldn't it be 9 TB per 
array.  We have a 13+2 RAID6 + hot spare array with 750 GB drives and with 
XFS file system we get 8.9 TiB.

> So 9 of the above = $216k and 108TB usable, each of the arrays Dell claims
> can manage 800MB/sec, things don't scale perfectly but I wouldn't be surprised
> to see 3-4GB/sec using PVFS2 or Lustre.  Actual data points appreciated, we
> are interested in a 1.5-2.0GB/sec setup.

Based on 8.9 TiB above for 16 drives, it looks like 8.2 TiB for 15 drives. 
so we'd want 12 of these to get about 98 TiB usable storage. I don't know 
what the overhead is in PVFS2 or Lustre compared to XFS but I'd doubt it 
would be any less so we might even need 13.

So, 13 * $24K = $312K.  Ah, what's another $100K.

> Are any of the solutions you are considering cheaper than this?  Any of the
> dual opterons in a 16 disk chassis could manage the same bandwidth (both 3ware
> and areca claim 800MB/sec or so), but could not survive a file server death.

So far this is the best price for something that can theoretically give 
the desired performance.  I say theoretically here because I'm not sure 
what parts of this you have in place.  I'm trying to find real-world 
implementations that provide in the ballpark of 5 to 10 MB/sec at the 
nodes when on the order of a hundred nodes are writing/reading at the 
same time.

Are you using PVFS2 or Lustre with your N Opteron servers?  When you run a 
job with many nodes writing large files at the same time what kind of 
performance do you get per node?  What is your value of N for the number 
of Opteron server/disk arrays you have implemented?

Thanks again for all of this information.  I hadn't been thinking 
seriously of PVFS2 or Lustre because I'd been thinking more in the lines 
of individual disks in nodes.  Using RAID arrays would be much more 
manageable.  Are there others who have this type of system implemented who 
can provide performance results as well as a view on how manageable it is?

Thanks,

Steve