[Beowulf] Station wagon full of tapes

Tue May 26 09:26:19 PDT 2009

On Tue, 26 May 2009, Chris Dagdigian wrote:

>
> The flip side to your arguments is that I may not want my tax dollars spent 
> on allowing the NIH to operate peta-scale data repositories. I can't be more 
> specific than this -- my most recent exposure to a large government life 
> science directorate revealed that they were spending $500K/year on EMC 
> maintenance costs for a few tens-of-TBs worth of disk arrays that were going 
> on 6 years old!

Yeah, well, stupidity is a universal problem, even in the
government...;-) But this is why CBAs and smart people (working
together) are so important.

> I think my main interest in utility storage providers is that they can offer 
> geographical redundancy and large capacity at efficiencies that can't be 
> matched locally by individual institutions or even local groups of 
> institutions. When I look at the full costs of hosting, operating and 
> replicating the data in a local facility the numbers from the "utility" 
> providers start to look more attractive.
>
> It will be interesting to see how this all shakes out. The rate at which raw 
> disk cost is shrinking in price is amazing and may choke off the profit for 
> the utility providers who have invested heavily in building out.

I agree.  Moore's Law applies to more than just processors.  If only it
applied to bandwidth....;-)

   rgb

>
> My $.02 of course!
>
>
>
>
>
>
> On May 26, 2009, at 11:16 AM, Robert G. Brown wrote:
>
>> On Tue, 26 May 2009, Chris Dagdigian wrote:
>> 
>>> 
>>> I deal quite often with the "next-gen" DNA sequencing instruments that 
>>> produce 1TB/day in TIFF images that are then distilled down to the DNA 
>>> basecalls before the short reads are subjected to alignment. Then the 
>>> resulting longer sequences are usually aligned again against a reference 
>>> genome.
>>> 
>>> Lots of data, lots of computation.
>>> 
>>> The 1 Terabyte of TIFF images typically reduces down to about 200 GB in 
>>> intermediate data which is further distilled down into a few hundred KB of 
>>> actual sequence data. The entire process is interesting and it is a 
>>> massive Bio/IT challenge as these types of terabyte-scale data producing 
>>> lab instruments are popping up everywhere (the cost of one of these 
>>> instruments is now easily within reach of a single grant-funded researcher 
>>> at a facility of any size...). We are only a few technology revolutions 
>>> away from these boxes showing up in your point of care primary physician's 
>>> office (well not really, probably a backend service lab that your 
>>> physician outsources to ...)
>>> 
>>> Anyway the new data ingestion service that Amazon offers is, I think, 
>>> going to be a big deal in our field.
>> 
>> Sure, but why wouldn't it be cheaper for e.g. NSF or NIH to fund an
>> exact clone of the service Amazon plans to offer and provide it for free
>> to its supported research groups (or rather, do bookkeeping but it is
>> all internal bookkeeping, moving money from one pocket to another).
>> 
>> Amazon has to make a profit.  Granting agencies don't have to pay the
>> profit that Amazon has to make.  Amazon has to take substantial risks to
>> make its profit.  Granting agencies have no risk.
>> 
>> All of the things you assert for DNA sequencing are true for high energy
>> physics.  Enormous datasets, lots of computation.  HEP's INTERNATIONAL
>> solution is ATLAS, not Amazon.
>> 
>> Supporting commercial access into such a DB a la >>google<< but for
>> genomic data, sure, but that's not really cluster computing, that's a
>> large shared DB.  I could see that as a spin off data service of Amazon
>> or Google or a new business altogether, but I'd view it as a niche and
>> not really HPC.
>> 
>> Grant funded research involving large scale shared data resources can
>> ALWAYS be done more cheaply than by buying the data services from
>> profit-making third parties unless there are nonlinear e.g. proprietary
>> IP barriers.  This is trebly true given that research facilities are
>> typically on a very high speed networks e.g. lambda rail that the
>> government is funding anyway, where Amazon or other commercial third
>> parties have to rent time on those networks and then resell the rental
>> back to the government at a profit or use slower commercial networks and
>> with the same sort of throughput markup.
>> 
>> Are there any such barriers here?  I'd have to say that I would be most
>> unhappy seeing my own tax dollars going to make Amazon shareholders rich
>> when they could be spent more efficiently without a middleman raking in
>> a 50 to 100% markup on the service.  Of course I'm easily irked -- when
>> I think of all the money spent on Windows by the US government it makes
>> my blood boil.
>> 
>> I'd want to see a solid CBA proving that this is the cheapest way to
>> proceed before dumping tons of tax money into it, if I were king of the
>> world (or just in charge of a major granting agency).
>>
>>  rgb
>> 
>>> 
>>> For the following reasons:
>>> 
>>> - Bio people are being buried in data
>>> - Once we process the data to get the derived results, the primary data 
>>> just needs to go somewhere cheap
>>> - Amazon and other internet-scale people can do peta-scale or exa-scale 
>>> storage far better & cheaper than any of my customers
>>> - These instruments are popping up in wet labs across campus with 
>>> weak/anemic network links to IT core facilities and data centers
>>> - Scientists in many cases are required to share data that is grant funded
>>> - Amazon has some neat "downloader pays" models that make it easier for 
>>> researchers to affordably offer up peta-scale data sets for sharing
>>> 
>>> I suspect that very large amount of scientific data will be making a 1-way 
>>> trip into the cloud. The data will stay there "forever" as a deep store. 
>>> In the ocasional cases where the data needs to be re-processed or 
>>> re-analyized it would be not unreasonable to fire up some cloud server 
>>> nodes to do the re-work in-situ.
>>> 
>>> The disk ingest service was the final piece. I can see this happening in 
>>> life science environments:
>>> 
>>> - Massive data generated in the wet lab
>>> - Captured to local storage (10 - 40TB) with small HPC component
>>> - Data is processed locally into derived and distilled forms
>>> - Derived data replicated to campus/lab facilities for online primary 
>>> storage
>>> - Derived data (and possibly the full raw data) is compressed, placed onto 
>>> drives and ingested into Amazon for long term storage
>>> - If re-analysis is ever needed, have existing EC2 AMIs preloaded with the 
>>> necessary software
>>> 
>>> Basically it comes down to the fact that Amazon may be able to offer 
>>> big-yet-slow storage in the terabyte to petabyte range at levels of cost 
>>> and geographical redundancy that would be extremely difficult to match 
>>> with local resources at a small non-specialized organization.
>>> 
>>> My $.02 of course
>>> 
>>> -Chris
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On May 26, 2009, at 8:58 AM, Jeff Layton wrote:
>>> 
>>>> Gerry Creager wrote:
>>>>> There was an interesting brainstorming session at Rocks-A-Palooza a 
>>>>> couple of weeks ago.  Someone wants to offer Amazon resources.  Problem 
>>>>> remains for me: How can I get sufficient cloud resources for computing 
>>>>> (I'll hammer on dataset transport in a moment) that will handle 
>>>>> reasonable weather models with their small message MPI chatter, and lots 
>>>>> of file I/O? I've been assured that Amazon's ready to accommodate that.
>>>> This is one of the problems - clouds aren't ready for this kind of
>>>> usage model yet. They only have GigE and usually it's oversubscribed.
>>>> When you say file IO, they hear capacity, not performance (either
>>>> throughput or IOPS). And as you point out, the pipe to/from the
>>>> cloud is not ready for lots of data.
>>>>> However, getting data into S3 for availability, when a daily 
>>>>> multi-gigabyte dataset is used for initiation, and another is created as 
>>>>> output, is going to be expensive, and likely slow.  I think there are 
>>>>> other approaches that have to be evaluated.  I am not sure the cloud is 
>>>>> ready for MPI play on a significant basis, just yet.
>>>> I haven't seen the cloud ready yet for anything other than embarrassingly
>>>> parallel codes (i.e. since node, small IO requirements). Has anyone seen
>>>> differently? (as an example of what might work, CloudBurst seems to be
>>>> gaining some traction - doing sequencing in the cloud. The only problem
>>>> is that sequencing can generate a great deal of data pretty rapidly).
>>>> Jeff
>>>> _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>>> To change your subscription (digest mode or unsubscribe) visit 
>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>> 
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit 
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>> 
>> Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
>> Duke University Dept. of Physics, Box 90305
>> Durham, N.C. 27708-0305
>> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
>> 
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu