Sunday, April 3, 2011

Build your own internet search engine

If you are interested in how to really build a web search engine I suggest to read the second part of this article ("Build your own internet search engine - Part 2") and the section about Apache's search engine software stack at the end of this article.
The CouchDB attempt for the web search engine backend didn't work out, but nevertheless I think this article is quite interesting :-).

A few weeks ago I started building my own (naive) internet search engine using Erlang technologies. I have chosen Erlang projects for that because I think they are perfectly suited for internet backend systems. Now I am stuck at ranking the search results. I will have to read some papers about that before going on :-). Though, up to this point in time everything worked out extremely well.
The first part of the puzzle was to build the search bots that bring home the websites for generating the search index. The search bots were build using the Erlang OTP, ibrowse, mochiweb, mochiweb_xpath and couchbeam projects.
The search engine starts by sending out the first search bot to some website, e.g. http://www.nytimes.com. A search bot downloads a website, forks off new search bots for any links that are found on it and then processes the website. After processing a website each search bot creates (or updates) a CouchDB document that represents the original website along with some keywords, a page rank, etc.. This process is repeated over and over again by each search bot (and for each website). You may imagine that the whole thing gets massively parallel in a very short time.
This massive parallelism caused some headaches to my home router because it was only able to handle a few thousand concurrent HTTP connections. So I limited the concurrency using an Erlang supervisor process. Maybe I will try out Amazon's EC2 in the future. I'm pretty sure they will perform better at this point :-).
The next part of the puzzle was to build the search index from the CouchDB documents that were brought home by the search bots. This is done using a CouchDB design document. Here is a simplified design document that shows how to generate the search index:

"map": "function(doc) {
    var tokens;
    if (doc.keywords) {
        tokens = doc.keywords.split(/[^A-Z0-9_]+/i);
        tokens.map(function(token) {
            for (i = 1; i <= token.length; i += 1) {
             emit(token.slice(0, i), doc);
            }
        });
    }
}"
"reduce": "function(keys, values, rereduce) {
    var output = {};
    if (!rereduce) {
        for (var i in values) {
            output[i] = values[i].url;
        }
    }
    return output;
}"

Now, I was able to query the search index using HTTP requests. For example, the following HTTP POST request queries the search index for the keywords "earth" and "energy". As result you get links to all documents that match these keywords.

curl -X POST http://127.0.0.1:5984/search_index/_design/search_queries/_view/query?group=true -d '{"keys": ["earth", "energy"]}' -H "Content-Type: application/json"

At that point in time I got stuck due to insufficient knowledge about how to appropriately merge and rank the documents that are retrieved from the CouchDB inverted search index. But exactly this ranking is the crux of a good search engine. My idea is to first sort the suggested websites by the number of matching keywords and second by a page rank that is derived from Albert-Laszlo Barabasi's book "Linked": The more links refer to a specific website the higher the page rank for that website.

The web interface for the search engine will simply be a little Couch app.

One thing that I have learned from this hobby project up till now is that scalability really means specialism if you build huge systems like search engines. That is exactly what CouchDB does in order to be fast and scalable. I am sure that the same is true for Google's search engine infrastructure.

This story goes on in this blog post containing part II on the topic.


Update: The Apache search engine software stack
Before going on with the Erlang approach to build a search engine I now have looked at the Apache software stack to build search engines. It looks pretty complete and scalable. The information below is taken from Apache's project websites.

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. It includes the HDFS, which is a distributed file system that provides high throughput access to application data. It also provides a MapReduce software framework for distributed processing of large data sets on compute clusters. The MapReduce framework can work on top of the HDFS. Check this link for a Hadoop MapReduce example.
Hadoop also contains the Hive framework which provides a mechanism to project structure onto data and query the data using a SQL-like language called HiveQL. It does so by transforming the HiveQL statements to algorithms for Hadoop's MapReduce framework. Hive was developed by Facebook.

Apache Nutch is a highly scalable and relatively feature rich (web) crawler. It contains search bots and other stuff. E.g. Nutch offers features like politeness (obeys robots.txt rules), robustness and scalability (Nutch runs on Apache Hadoop, so you can run Nutch on a single machine or on a cluster of 100 machines), quality (you can bias the crawling to fetch “important” pages first) and extendability. One of the most important single feature Nutch provides out of the box is a link database. Therefore Nutch tracks links between pages so that the relevancy of search results within a collection of interlinked documents goes well beyond the naive case where you index documents without link information and anchor texts.

Apache Solr is an enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites. Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language.

When combining Nutch with Solr, Solr will be used as the only source for serving search results (including snippets). This way you can totally decouple your search application from Nutch and still use Nutch where it is at its best: crawling and extracting the content. Using Solr as the search backend, on the other hand, allows you to use all of the advanced features of a Solr server – like query spell checking, “more like this” suggestions, data replication and easy query time relevancy tuning, etc..
So Nutch collects the data and Solr serves it via its search index.
See also this blog entry by Sami Siren for more details.

4 comments:

  1. CouchDB @ InfoQ: http://www.infoq.com/presentations/couchdb-Will-Leinweber

    ReplyDelete
  2. Cloudant has added fulltext search to CouchDB: http://support.cloudant.com/kb/search/search-api.
    But the source code of the non Lucene-based fulltext search is not released yet :-(.

    ReplyDelete
  3. elasticsearch: http://www.elasticsearch.org/guide/reference/river/couchdb.html.

    ReplyDelete

Note: Only a member of this blog may post a comment.