[Beowulf] Storage

Thu Oct 7 16:59:59 PDT 2004

On Thu, 7 Oct 2004, Mark Hahn wrote:

> > that means:-) in addition to commensurate amounts of tape backup.  The
> 
> ick!  our big-storage plans very, very much hope to eliminate tape.
> 
> > tape backup is relatively straightforward -- there is a 100 TB library
> > available to the project already that will hold 200 TB after an
> > LTO1->LTO2 upgrade, and while tapes aren't exactly cheap, they are
> > vastly cheaper than disk in these quantities.
> 
> hmm, LTO2 is $0.25/GB; disks are about double that.  considering the 
> issues of tape reliability, access time and migration, I think 
> disk is worth it.  from what I hear in the storage industry, this 
> is a growing consensus among, for instance, hospitals - they don't 
> want to spend their time reading tapes to see whether the media is 
> failing and content needs to be migrated.  migrating content that's 
> online is ah, easier.  in the $ world, online data is attractive in part
> so its lifetime can be more explicitly managed (ie, deleted!)

It isn't the media, it's the way it is served.  Tape is ballpark of
$250/TB, but once you've invested in a general shell -- a tape library
of whatever size you want to pay for -- cost scales linearly, and it
(tape) is easy and relatively safe to transport.  Disk, by the time you
wrap it up, serve it, connect it to this and that, and provide it with
this and that costs much more.  Otherwise I agree with most of what you
say, but remember, I didn't write the RFP specs.

Besides, today they decided to drop the 60 TB of tape spec.  Oops!

We'll still meet or exceed it anyway, as we have a big tape library that
is conveniently underutilized handy, so we REALLY just pay for the
media (plus maybe kick in a drive or two).

> > The disk is a real problem.  Raw disk these days is less than $1/GB for
> > SATA in 200-300 GB sizes, a bit more for 400 GB sizes, so a TB of disk
> > per se costs in the ballpark of $1000.  However, HOUSING the disk in
> > reliable (dual power, hot swap) enclosures is not cheap, adding RAID is
> > not cheap, and building a scalable arrangement of servers to provide
> > access with some controllable degree of latency and bandwidth for access
> > is also not cheap.
> 
> no insult intended, but have you looked closely, recently?  I did some 
> quick web-pricing this weekend, and concluded:
> 
> vendor          capacity        size    $Cad list per TB        density
> dell/emc	12x250          3U      $7500                   1.0 TB/U
> apple		14x250          3U      $4000                   1.166
> hp/msa1500cs	12x250x4        10U     $3850                   1.2
> 
> (divide $Cad by 1.25 or so to get $US.)  all three plug into FC.
> the HP goes up to 8 shelves per controller or 24 TB per FC port, though.

So you add FC switch and server(s) and end up at a minimum of around
$5K/TB.  The maximum prices I'm seeing reported by respondants and that
we've seen in quotes or prices of actual systems are well over $10K/TB,
some as high as $30K/TB.  Price depends on how fast and scalable you
want it to be, which in turn depends on how proprietary it is.  But I'll
summarize all of this when I get through the proposal and can breathe
again.

The cheapest solutions are those you build yourself, BTW -- as one might
expect -- followed by ones that a vendor assembles for you, followed in
order by proprietary/named solutions that require special software or
special software and special hardware.  Some of the solutions out there
use basically "no" commodity parts that you can replace through anybody
but the vendor -- they even wrap up the disks themselves in their own
custom packaging and firmware and double the price in the process.

> > Management requirements include 3 year onsite
> > service for the primary server array -- same day for critical
> > components, next day at the latest for e.g. disks or power supplies that
> > we can shelve and deal with ourselves in the short run.  The solution we
> 
> pretty standard policies.
> 
> > adopt will also need to be scalable as far as administration is
> > concerned -- we are not interested in "DIY" solutions where we just buy
> > an enclosure and hang it on an over the counter server and run MD raid,
> > not because this isn't reliable and workable for a departmental or even
> > a cluster RAID in the 1-8 TB range (a couple of servers) it isn't at all
> > clear how it will scale to the 10-80 TB range, when 10's of servers
> > would be required.
> 
> Robert, are you claiming that 10's of servers are unmanagable
> on a *cluster* mailing list!?!  or are you thinking of the number
> of moving parts?

I'm thinking of scalability of management at all levels, and performance
at all levels.  I don't >>think<< that I'm crazy in thinking that this
is an issue in large scale storage design -- at least one respondant so
far suggested that I wasn't radical enough and that off-the shelf or
homemade SAN solutions are doomed to nasty failure at very large (100+
TB) sizes.  I'm not certain that I believe him (I had several people
describe their off-the-shelf solutions that scale to 100+ TB sizes, and
was directed to e.g. http://www.archive.org/web/petabox.php) but think
of me as being hypercautious in my already admitted ignorance;-)

That is, if there are no issues and people are running stacks of 6.4 TB
enclosures hanging off of OTC linux boxes and managing the volumes and
data issues transparently and they scale to 100's of TB, sure, I'd love
to hear about it.  Now I have, although there are issues, there are
issues.  As I said, I'll summarize (and maybe start some lovely
arguments:-) when I'm done but I'm still DIGESTING all the data I've
gotten from vendors and list-friends (all of whom I profoundly thank!).

> > Management of the actual spaces thus provided is not trivial -- there
> > are certain TB-scale limits in linux to cope with (likely to soon be
> > resolved if they aren't already in the latest kernels, but there in many
> > of the working versions of linux still in use) and with an array of
> 
> I can understand and even emphathize with some people's desire to 
> stick to old and well-understood kernels.  but big storage is a very 
> good reason to kick them out of this complacency - the old kernel are 
> justifiable only on not-broke-don't-fix grounds...

Again, agreed, but one wants to be very conservative in a project
proposal, especially when we HAVE NO CHOICE as to the actual kernel or
OS distribution -- we will have to just "install the grid" with a
package developed elsewhere by people that you or I might or might not
agree with.  Historically, in fact, I think that there is no chance that
either one of us would do things the way they have done them so far, and
maybe we will ultimately influence the design, but when writing the
proposal we have to assume that we'll be using their linux.  Where at
least we've talked them up from some -- shall we say old? obsolete?
non-x64 supporting? versions of linux and the associated kernels and
libraries as a base... (you get the idea).

> > partitions and servers to deal with, just being able to index, store and
> > retrieve files generated by the compute component of the grid will be a
> > major issue.
> 
> how so?  I find that people still use sensible hierarchical organization,
> even if the files are larger and more numerous than in the past.

It's a grid, and we're trying to avoid direct NFS mounts on all the
nodes for a variety of reasons (like performance, reliability, security)
and because in this kind of grid people will need to use fully automated
schema for data storage, retrieval, and archival migration on and off
the main data store.

Honestly, I personally think that the data management issue and toolset
is MORE important than the hardware.  As you note, we can build arrays
of disk servers or arrays of disk and associated servers or network
appliances and arrays of disk a variety of ways, including DIY with a
fairly obvious design.  In order for people to be able to direct a node
to run for a week and drop its results, properly indexed and
crossreferenced by user/group/program/parameters in a database,
somewhere into the data store where it will be transparently migrated
onto and off of an attached tape archive as needed AND possibly resync'd
back to a project CENTRAL store AND possibly sync'd back to the home LAN
and store of the grid user for local processing --- it is doable, sure,
but I wouldn't call it trivial or necessarily doable without some
hacking or involvement in OS projects addressing this issue or purchase
of proprietary software ditto.

If it is trivial, and there is a simple package that does all this and
eats your meatloaf for you out to a PB of data, please enlighten
me...:-)

> >   a) What are listvolken who have 10+ TB requirements doing to satisfy
> > them?
> 
> we're acquiring somewhere between .2 and 2 PB, and are planning machinrooms
> around the obvious kinds of building blocks: lots of servers that are in 
> the say 4-20 TB range, preferably connected by some fast fabric (IB seems
> attractive, since it's got mediocre latency but good bandwidth.)

Ya.

> >   b) What did their solution(s) cost, both to set up as a base system
> > (in the case of e.g. a network appliance) and
> 
> I'm fairly certain that if I were making all the decisions here, I'd 
> go for fairly smallish modular servers plugged into IB.

Any idea of what that would cost?

> 
> >   c) incremental costs (e.g. filled racks)?

I meant "cost of additional filled disk enclosures" once you've bought
in.  Some solutions involve network appliances with a large capital
investment before you buy your first disk enclosure, and then scale
linearly with filled enclosures to some point, where you buy another
appliance.  Some solutions already specify an appliance interconnect so
that the whole thing is transparent to your cluster.  Some solutions are
expensive, expensive.

I'm just trying to figure out HOW expensive, and how far we can go for
what we can afford with the different alternatives.  I'm happy for
anyone to tell me the virtues of the expensive systems (the benefits) as
long as I have the costs in hand as well, so I can ultimately do the old
fashioned CBA.

> >   d) How does their solution scale, both costwise (partly answered in b
> > and c) and in terms of management and performance?
> 
> my only real concern with management is MTBF: if we had a hypothetical 
> collection 2PB of 250G SATA disks with 1Mhour MTBF, we'd go 5 days between
> disk replacements.  to me, this motivates toward designs that have fairly
> large numbers of disks that can share a hot spare (or maybe raid6?)

Right, but if your hypothetical array of disks also involved a stack of
over the counter servers, network switches (of any sort, eg IB or FC or
GE), and so on, there isn't just the disk to worry about -- in fact, in
a good RAID enclosure it is relatively straightforward to deal with the
disk (and hot swap power and hot swap fan) failures.  Dealing with
intrinsic server failures, e.g. toasted memory, CPU, CPU fans, CPU power
supply (maybe, unless server has dual power) and sundry networking or
peripheral card failures takes a lot more time and expertise, and can
take down whole blocks of disks if the disk is provided only via direct
connections to specific servers.

Both human effort and expertise required and projected downtime depend a
lot on how you build and set things up.  Or rather, I >>expect<< it to,
and am seeking war-stories (stories of profound failures where some
design was FUBAR and ultimately abandoned for cause, especially) so I
can figure out which designs to avoid because they DON'T scale in
management.

Performance scaling is also important, but we're not looking for the
fastest possible solution or truly superior performance scaling (the
kinds of solutions that cost the $10K+/TB sorts of prices).  Unless of
course all the other solutions simply choke to death at some e.g. 80 TB
scale.  I don't "expect" them too, sure, but if I knew the answer, why'd
I ask?

> 
> >   e) What software tools are required to make their solution work, and
> > are they open source or proprietary?
> 
> I'd be interested in knowing what the problem is that you're asking to be
> solved.  just that you don't want to run "find / -name whatever" on 
> a filesystem of 20 TB?  or that you don't want 10 separate 2TB filesystems?

Partially described above.  The dataflow we are expecting isn't unique
to our problem, BTW.  One respondant with almost exactly the same needs
described a tool they are developing that is designed fairly
specifically to manage the dataflow and archival/migration issues
transparently.  I'm waiting to hear whether it interfaces with any sort
of indexing schema or toolset -- if so, it would simply solve the
problem.  Solve it for the cheapest possible (hardware reliable, COTS
component) data stack -- a pile of OTS multiTB servers -- as well!

In case the above wasn't clear, think:

a) Run 1 day to 1 week, generate some 100+ GB per CPU on node local
storage;

b) Run hours to days, reduce the data to "interesting" and compressed
form, occupying maybe 10% of this space.  How the actual data is
originally created (one big file or many little files, e.g.) I haven't a
clue yet.  How it is aggregated ditto.  At some point, though;

c) condensed data (be it in one 10 GB file or 10 1 GB files or larger or
smaller fragments) is sent in to the central store, where it has to be
saved in a way that is transparent to the user, indexed by the
generating program, its parameters, the generating/owning group, various
node and timestamp metadata, all in a DB that is searchable by the large
community that wants to SHARE this data.  So "find" is clearly out, even
find with really long filenames.  Find is REALLY out if you think about
its performance scaling as you fill the store with lots of inodes.

d) Once on the central store, the data has to be able to stay there (if
it is being used), be backed up to tape (regardless), be MIGRATED to
tape to free space on the central store for other data that IS being
used, be retreiveable from backup or archive, be downloadable by the
generating user to a home faraway for local processing, be downloadable
by OTHER groups/users to THEIR homes faraway, and be uploadable to a
PB-scale toplevel store and centralized archive in a higher tier of the
grid.

e) and maybe other stuff.  The RFP wasn't horribly detailed (it wasn't
at ALL detailed) and the material we've obtained from grid prototype
sites isn't very helpful at the design phase.  So we may NEED to export
NFS space to the nodes or use XFS and some fancy toolsets or the like,
but we're hoping to avoid this if the actual workflow permits it.  On a
grid, it "should", since grid tasks should all use "grid functions" to
accomplish macroscopic tasks, not Unix/linux/posix functions or tools.

> >   f) Along the same lines, to what extent is the hardware base of their
> > solution commodity (defined here as having a choice of multiple vendors
> > for a component at a point of standardized attachment such as a fiber
> > channel port or SCSI port) or proprietary (defined as if you buy this
> > solution THIS part will always need to be purchased from the original
> > vendor at a price "above market" as the solution is scaled up).
> 
> as far as I can see, the big vendors are somehow oblivious of the fact
> that customers *HATE* the proprietary, single-source attitude.  
> 	oh, you can plug any FC devices you want into your san, 
> 	as long as they're all our products and we've "qualified" them.

You are now the third or fourth person to make THAT observation.

  "Standards?  We don' care about no stinkin' standards..." (apologies
  to Mel Brooks and Blazing Saddles...;-)

> 
> > Rules:  Vendors reply directly to me only, not the list.  I'm in the
> > market for this, most of the list is not.  Note also that I've already
> 
> I think you'd be surprised at how many, many people are buying 
> multi-TB systems for isolated labs.  there are good reasons that 
> this kind of scattershot approach is not wise in, say, a university
> setting, where a shared resource pool can respond better to burstiness,
> consistent maintenance, stable environment, etc.

I agree again.  Hell, I maintain a 3x80 GB disk IDE RAID in my HOME
server these days, and the only thing special about the "80" is the age
of the disks -- next time I upgrade it I'll likely make it close to a TB
just because I can.  So TB-scale storage is to be expected in most
departmental size computing efforts at $1/GB plus housing and server.

100 TB-scale storage is a different beast.  One is really engineering a
storage "cluster" and like all cluster engineering, the optimal result
depends on the application mix and expected usage; a "recipe" based
solution might work or it might lead to disaster and effectively
unusuable resources due to bottlenecks, contention, or management
issues.

Cluster engineering I have a reasonable understanding of; storage
cluster engineering at this scale is way beyond my ken, although I'm
learning fast.

If only I had a couple of hundred thousand dollars, now, I'd build and
buy a bunch of prototypes and really learn it the right way...;-)

 Thanks enormously for the response,

    rgb

> 
> regards, mark hahn.
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu