[Beowulf] Storage - housing 100TB

Alvin Oga alvin at Mail.Linux-Consulting.com
Thu Oct 7 00:42:35 PDT 2004

hi ya robert

our solution is scalable using off the shelf commodity parts
and open source software

- we also recommend a duplicate system for "live backups"

- we can customize our products ( hardware solutions ) to fit the clients
  requirements and budget

- example large 100TB disk-subsystem
	on 4 disks per blade ........ 1.2TB per blade with 300GB disks
	10 blades per 4U chassis .... 12TB per 4U chassis
	10 4U chassis per rack ...... 120TB per 42U rack

	- model shown holds 4 disks, but we can fit 8-disks in it

	- cooling ( front to back or top to bottom ) is our main concerns
	  that we try to solve with one solution

- system runs on +12V dc input
	- 2x 600W 2U powersupply is enough power for driving the system

- i'd be more than happy to send a demo chassis and blades, no charge
  if we can get feedback that you've used it and built it out as you
	- hopefully you can provide the disks, mb, cpu, memory,

	- we can provide the "system assembly and testing time" at 
	"evaluation" costs  ( all fees credited toward the purchase )

	- you keep the 4U chassis afterward ( no charge )

On Wed, 6 Oct 2004, Robert G. Brown wrote:

> I'm helping assemble a grant proposal that involves a grid-style cluster
> with very large scale storage requirements.  Specifically, it needs to
> be able to scale into the 100's of TB in "central disk store" (whatever
> that means:-) in addition to commensurate amounts of tape backup.  The

good .. sounds like fun

> tape backup is relatively straightforward -- there is a 100 TB library
> available to the project already that will hold 200 TB after an
> LTO1->LTO2 upgrade, and while tapes aren't exactly cheap, they are
> vastly cheaper than disk in these quantities.

- tape backups are not cheap ...
- tape backups are not reliable ( to save the tapes and restore )
	- dirty heads, tapes that need to be swapped, ..
- tape backups are too slow ( to restore )

> The disk is a real problem.  Raw disk these days is less than $1/GB for
> SATA in 200-300 GB sizes, a bit more for 400 GB sizes, so a TB of disk
> per se costs in the ballpark of $1000.

yup.. good ball park

>  However, HOUSING the disk in
> reliable (dual power, hot swap) enclosures is not cheap, adding RAID is
> not cheap,

it can be ...

does it need to be dual-hot-swap power supplies ??
	- no problem... we can provide that (though not a pretty "case" )

raid is cheap ... but why use raid ... there is no benefit to using
software or hardware raid at this size data ...

	- time is better spent in optimizing data and backup of
	the data to a 2nd system

	- it is NOT trivial to backup 20TB - 100TB of data

	- raid'ing reduces the overall reliability ( more things to
	fail ) and increases the system admin costs ( more testing )

> and building a scalable arrangement of servers to provide
> access with some controllable degree of latency and bandwidth for access
> is also not cheap.

not sure what the issues are .. 
- it'd depend on the switch/hub, and "disk subsystem/infrastructure"

>  Management requirements include 3 year onsite
> service for the primary server array -- same day for critical
> components,

we'd be using a duplicate "hot swap backup system"

>  next day at the latest for e.g. disks or power supplies that
> we can shelve and deal with ourselves in the short run. 

most everything we use is off the shelf and be kept on the shelf
for emergencies

power supplies, disks, motherboards, cpu, memory, fans

> The solution we
> adopt will also need to be scalable as far as administration is
> concerned --

scaling is easy in our case ...

> we are not interested in "DIY" solutions where we just buy
> an enclosure and hang it on an over the counter server and run MD raid,

we can build and test for you ( onsite if needed )

> not because this isn't reliable and workable for a departmental or even
> a cluster RAID in the 1-8 TB range (a couple of servers) it isn't at all
> clear how it will scale to the 10-80 TB range, when 10's of servers
> would be required.

we don't forecast any issues with sw raid ...
	on 4 disks per blade ........ 1.2TB per blade with 300GB disks
	10 blades per 4U chassis .... 12TB per 4U chassis
	10 4U chassis per rack ...... 120TB per 42U rack
> Management of the actual spaces thus provided is not trivial 

actual data to save would be a bigger issue than the
saving of it onto the disk subsystems

> -- there
> are certain TB-scale limits in linux to cope with (likely to soon be
> resolved if they aren't already in the latest kernels, but there in many
> of the working versions of linux still in use) and with an array of

individual file size issues would limit the raw data one can save

other way around it is to use custom device drivers like oracle
that uses their own "raw data" drivers to get around file size limiations

> partitions and servers to deal with, just being able to index, store and
> retrieve files generated by the compute component of the grid will be a
> major issue.

that depends on how the data is created and stored ???
	- we dont think it as a major issue, as long as each "TB-sized 
	files" can be indexed properly at the time of its creation

> SO, what I want to know is:
>   a) What are listvolken who have 10+ TB requirements doing to satisfy
> them?

we prefer non-raided systems ... and duplicate disk-systems for backup
>   b) What did their solution(s) cost, both to set up as a base system
> (in the case of e.g. a network appliance) and

raw components is roughly $25K per 12TB in one 4U chassis

	- add marketing/sales/admin/contract/onsite costs to it
	( $250K for fully Managed - 3yr contracts w/ 2nd backup system )

>   c) incremental costs (e.g. filled racks)?

the system is expandable as needed per 1.2TB blade or 12TB ( 4U chassis )

additional costs to intall additional blades into the disk-subsystem
is incremental for the time needed to add its config to the existing
config files for the disk subsystem ( fairly simple, since the rest of
the system has already been tested and operational )

>   d) How does their solution scale, both costwise (partly answered in b
> and c) and in terms of management and performance?

partly and answered above

scalable solutions is accomplished with modular blades and blade chassis

>   e) What software tools are required to make their solution work, and
> are they open source or proprietary?

just the standard linux software raid tools in the kernel 

everything is open source
>   f) Along the same lines, to what extent is the hardware base of their
> solution commodity (defined here as having a choice of multiple vendors

everything is off-the-shelf

we have the proprietory 4U blade chassis for "holding the blades" in place
along with the power supply 
	( the system can be changed per customer requirements

> for a component at a point of standardized attachment such as a fiber
> channel port or SCSI port)

fiber channel cards may be used if needed, but it'd require some
	- fiber channel PCI cards are expensive and it is unclear
	if its required or not

> or proprietary (defined as if you buy this
> solution THIS part will always need to be purchased from the original
> vendor at a price "above market" as the solution is scaled up).

everything is off-the-shelf

> Rules:  Vendors reply directly to me only, not the list.

i was wondering why nobody replied publicly :-)

> I'm in the
> market for this, most of the list is not.  Note also that I've already
> gotten a decent picture of at least two or three solutions offered by
> tier 1 cluster vendors or dedicated network storage vendors although I'm
> happy to get more.

i hope "name brand" is not the primary evaluation consideration
> However, I think that beowulf administrators, engineers, and users
> should likely answer on list as the real-world experiences are likely to
> be of interest to lots of people and therefore would be of value in the
> archives.  I'm hoping that some of you bioinformatics people have
> experience here, as well as maybe even people like movie makers.

we've been indirectly selling small systems to the movie industry 
( by the hundred's of systems )  .. its just a simple mpeg player :-)

> FWIW, the actual application is likely to be Monte Carlo used to
> generate huge data sets (per node) and cook them down to smaller (but
> still multiGB) data sets, and hand them back to the central disk store
> for aggregation and indexed/retrievable intermediate term storage, with

good ...

> migration to the tape store on some as yet undetermined criterion for
> frequency of access and so forth.  Other uses will likely emerge, but

i'd avoid tape storage due to costs and index/restore/uptime issues

> this is what we know for now.  I'd guess that bioinformatics and movie
> generation (especially the latter) are VERY similar in the actual data
> flow component and also require multiTB central stores and am hoping
> that you have useful information to share.

have fun

More information about the Beowulf mailing list