[Beowulf] search engine

Robert G. Brown rgb at phy.duke.edu
Tue Jan 4 10:14:42 PST 2005


On Tue, 4 Jan 2005, Noel Tanmoy Das wrote:

> how can i build a search engine (e.g. something like google) in a
> beowulf cluster? help wanted.

Wrong cluster type.  This is called a "high availability" type cluster,
although it certainly shares a lot of features with beowulf or HPC
clusters.

There are several answers possible here.  One is to contact google and
buy/rent their engine.  It is a very, very good one and for a
professional enterprise project that requires an internal/private search
engine well worth the cost.  A second one (if all you want to do is let
people search for stuff you have up on a big website) is to use google
for free -- it is fairly trivial to add a google box to any web page.

If you want to WRITE an open-source search engine to e.g. COMPETE with
google -- well, using google with something like "search engine open
source" as the string turns up a list of free and open source tools at
e.g. http://www.searchtools.com/tools/tools-opensource.html.  I'd look
over these projects, pick the best one that has the most active group
working on it, and join the project rather than starting your own from
scratch.  It is very likely that one or more of the projects listed on
this page already run on a cluster of some sort, as building and
searching a very, very large database is a task with lots of natural
parallelism.  

It is also very nontrivial -- I couldn't begin to tell you exactly how
it all works as I don't know.  To me google is just plain black magic --
it seems to crossreference EVERYTHING on the web all the way down to
fairly deep embedded text (at a guess, well over a petabyte of
distributed data) and still returns hits on most searches in a matter of
seconds, no matter what the search string and no matter when you use it.

It's like a tiny piece of the mind of God... or if you prefer a less
blasphemous metaphor derived from "The Lucifer Principle", it is the
memory function of the extended neural network that forms the
superorganism known as "The Web", where we, and the websites we
contribute and maintain, are the neurons themselves.  If the human race
has a developing collective intelligence, this is a core piece of it.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu





More information about the Beowulf mailing list