On Fri, Oct 29, 2010 at 03:02:45PM -0400, Ellis H. Wilson III wrote:

> I think it's making a pretty wild assumption to say search engines and  
> HPC have the same I/O needs (and thus can use the same I/O setups).

Well, I'm an HPC guy doing infrastructure for a search engine, so I'm
not assuming much. And I didn't say the setup would be the same --
just that Lustre/PVFS would probably be more reliable and higher
performance if they stored copies on multiple servers instead of using
local or SAN RAID. (Or did they implement this while I wasn't looking?)

> Also, I'd be blown away if Blekko wasn't doing it's own  
> striping/redundancy - even if they aren't using RAID 0 or 1 by the book,  
> they probably are using the same concepts (though hand-spun for search  
> engine needs).

We do the usual thing: store 3 copies on 3 different servers, locality
picked such that a single network or power failure won't take out more
than 1 copy. Since we are very concerned about transfer rates, it's
well worth buying more disks because the read speed increases.

> I don't think the "whole internet" takes up 5 petabytes,  

The internet is infinite in size thanks to websites that generate data
(or crap). Our 3 billion page crawl (1/5 of the size we dream of) is
257 tbytes (compressed), and the corresponding index is 77 terabytes
(very compressed). (Yes, we have a lot of disk space empty at the moment.)

-- greg

