[Beowulf] Station wagon full of tapes
Robert G. Brown
rgb at phy.duke.edu
Tue May 26 09:26:19 PDT 2009
On Tue, 26 May 2009, Chris Dagdigian wrote:
> The flip side to your arguments is that I may not want my tax dollars spent
> on allowing the NIH to operate peta-scale data repositories. I can't be more
> specific than this -- my most recent exposure to a large government life
> science directorate revealed that they were spending $500K/year on EMC
> maintenance costs for a few tens-of-TBs worth of disk arrays that were going
> on 6 years old!
Yeah, well, stupidity is a universal problem, even in the
government...;-) But this is why CBAs and smart people (working
together) are so important.
> I think my main interest in utility storage providers is that they can offer
> geographical redundancy and large capacity at efficiencies that can't be
> matched locally by individual institutions or even local groups of
> institutions. When I look at the full costs of hosting, operating and
> replicating the data in a local facility the numbers from the "utility"
> providers start to look more attractive.
> It will be interesting to see how this all shakes out. The rate at which raw
> disk cost is shrinking in price is amazing and may choke off the profit for
> the utility providers who have invested heavily in building out.
I agree. Moore's Law applies to more than just processors. If only it
applied to bandwidth....;-)
> My $.02 of course!
> On May 26, 2009, at 11:16 AM, Robert G. Brown wrote:
>> On Tue, 26 May 2009, Chris Dagdigian wrote:
>>> I deal quite often with the "next-gen" DNA sequencing instruments that
>>> produce 1TB/day in TIFF images that are then distilled down to the DNA
>>> basecalls before the short reads are subjected to alignment. Then the
>>> resulting longer sequences are usually aligned again against a reference
>>> Lots of data, lots of computation.
>>> The 1 Terabyte of TIFF images typically reduces down to about 200 GB in
>>> intermediate data which is further distilled down into a few hundred KB of
>>> actual sequence data. The entire process is interesting and it is a
>>> massive Bio/IT challenge as these types of terabyte-scale data producing
>>> lab instruments are popping up everywhere (the cost of one of these
>>> instruments is now easily within reach of a single grant-funded researcher
>>> at a facility of any size...). We are only a few technology revolutions
>>> away from these boxes showing up in your point of care primary physician's
>>> office (well not really, probably a backend service lab that your
>>> physician outsources to ...)
>>> Anyway the new data ingestion service that Amazon offers is, I think,
>>> going to be a big deal in our field.
>> Sure, but why wouldn't it be cheaper for e.g. NSF or NIH to fund an
>> exact clone of the service Amazon plans to offer and provide it for free
>> to its supported research groups (or rather, do bookkeeping but it is
>> all internal bookkeeping, moving money from one pocket to another).
>> Amazon has to make a profit. Granting agencies don't have to pay the
>> profit that Amazon has to make. Amazon has to take substantial risks to
>> make its profit. Granting agencies have no risk.
>> All of the things you assert for DNA sequencing are true for high energy
>> physics. Enormous datasets, lots of computation. HEP's INTERNATIONAL
>> solution is ATLAS, not Amazon.
>> Supporting commercial access into such a DB a la >>google<< but for
>> genomic data, sure, but that's not really cluster computing, that's a
>> large shared DB. I could see that as a spin off data service of Amazon
>> or Google or a new business altogether, but I'd view it as a niche and
>> not really HPC.
>> Grant funded research involving large scale shared data resources can
>> ALWAYS be done more cheaply than by buying the data services from
>> profit-making third parties unless there are nonlinear e.g. proprietary
>> IP barriers. This is trebly true given that research facilities are
>> typically on a very high speed networks e.g. lambda rail that the
>> government is funding anyway, where Amazon or other commercial third
>> parties have to rent time on those networks and then resell the rental
>> back to the government at a profit or use slower commercial networks and
>> with the same sort of throughput markup.
>> Are there any such barriers here? I'd have to say that I would be most
>> unhappy seeing my own tax dollars going to make Amazon shareholders rich
>> when they could be spent more efficiently without a middleman raking in
>> a 50 to 100% markup on the service. Of course I'm easily irked -- when
>> I think of all the money spent on Windows by the US government it makes
>> my blood boil.
>> I'd want to see a solid CBA proving that this is the cheapest way to
>> proceed before dumping tons of tax money into it, if I were king of the
>> world (or just in charge of a major granting agency).
>>> For the following reasons:
>>> - Bio people are being buried in data
>>> - Once we process the data to get the derived results, the primary data
>>> just needs to go somewhere cheap
>>> - Amazon and other internet-scale people can do peta-scale or exa-scale
>>> storage far better & cheaper than any of my customers
>>> - These instruments are popping up in wet labs across campus with
>>> weak/anemic network links to IT core facilities and data centers
>>> - Scientists in many cases are required to share data that is grant funded
>>> - Amazon has some neat "downloader pays" models that make it easier for
>>> researchers to affordably offer up peta-scale data sets for sharing
>>> I suspect that very large amount of scientific data will be making a 1-way
>>> trip into the cloud. The data will stay there "forever" as a deep store.
>>> In the ocasional cases where the data needs to be re-processed or
>>> re-analyized it would be not unreasonable to fire up some cloud server
>>> nodes to do the re-work in-situ.
>>> The disk ingest service was the final piece. I can see this happening in
>>> life science environments:
>>> - Massive data generated in the wet lab
>>> - Captured to local storage (10 - 40TB) with small HPC component
>>> - Data is processed locally into derived and distilled forms
>>> - Derived data replicated to campus/lab facilities for online primary
>>> - Derived data (and possibly the full raw data) is compressed, placed onto
>>> drives and ingested into Amazon for long term storage
>>> - If re-analysis is ever needed, have existing EC2 AMIs preloaded with the
>>> necessary software
>>> Basically it comes down to the fact that Amazon may be able to offer
>>> big-yet-slow storage in the terabyte to petabyte range at levels of cost
>>> and geographical redundancy that would be extremely difficult to match
>>> with local resources at a small non-specialized organization.
>>> My $.02 of course
>>> On May 26, 2009, at 8:58 AM, Jeff Layton wrote:
>>>> Gerry Creager wrote:
>>>>> There was an interesting brainstorming session at Rocks-A-Palooza a
>>>>> couple of weeks ago. Someone wants to offer Amazon resources. Problem
>>>>> remains for me: How can I get sufficient cloud resources for computing
>>>>> (I'll hammer on dataset transport in a moment) that will handle
>>>>> reasonable weather models with their small message MPI chatter, and lots
>>>>> of file I/O? I've been assured that Amazon's ready to accommodate that.
>>>> This is one of the problems - clouds aren't ready for this kind of
>>>> usage model yet. They only have GigE and usually it's oversubscribed.
>>>> When you say file IO, they hear capacity, not performance (either
>>>> throughput or IOPS). And as you point out, the pipe to/from the
>>>> cloud is not ready for lots of data.
>>>>> However, getting data into S3 for availability, when a daily
>>>>> multi-gigabyte dataset is used for initiation, and another is created as
>>>>> output, is going to be expensive, and likely slow. I think there are
>>>>> other approaches that have to be evaluated. I am not sure the cloud is
>>>>> ready for MPI play on a significant basis, just yet.
>>>> I haven't seen the cloud ready yet for anything other than embarrassingly
>>>> parallel codes (i.e. since node, small IO requirements). Has anyone seen
>>>> differently? (as an example of what might work, CloudBurst seems to be
>>>> gaining some traction - doing sequencing in the cloud. The only problem
>>>> is that sequencing can generate a great deal of data pretty rapidly).
>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>>> To change your subscription (digest mode or unsubscribe) visit
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>> Robert G. Brown http://www.phy.duke.edu/~rgb/
>> Duke University Dept. of Physics, Box 90305
>> Durham, N.C. 27708-0305
>> Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf