Saturday, May 7, 2011

Build your own internet search engine - Part 2

After having started to build my own internet search engine as described in a previous blog post, I now have read some papers and books about web search engine architecture and information retrieval to complete my hobby project. Here is a list of papers and books that I highly recommend to anybody who is interested in this topic:

1. Google: data structures and algorithms by Petteri Huuhka
2. The Anatomy of a Large-Scale Hypertextual Web Search Engine by the Google founders Sergey Brin and Lawrence Page
3. Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze
4. Effect of inverted index partitioning schemes on performance of query processing in parallel text retrieval systems by B. Barla Cambazoglu, Aytul Catal and Cevdet Aykanat
5. Distributed Web Crawling, Indexing, and Search by Ricardo Baeza-Yates and B. Barla Cambazoglu
6. Web Search for a Planet by Luiz André Barroso, Jeffrey Dean and Urs Hölzle
7. Building a Search Engine by David Evans and Sebastian Thrun

As described in my previous blog post I build the whole search engine using Erlang technologies. This worked out extremely well for the search bots. But using CouchDB for storing all the web documents, the forward search index and the inverted search index was a bad idea.
A NoSQL database like CouchDB is great for building a web store like Amazon but it is definitely not good at building a highly scalable web search engine.
The problem is that CouchDB is simply not specialized enough for this task. E.g. for every search query you have to look at millions of documents and rank them appropriately. But if you use something like CouchDB (which has a JSON interface) you just need too much resources of everything (CPU time, memory and network bandwidth) while merging and ranking the documents for multiple search keywords. Now I know that :-).

So I have to remove CouchDB from my search engine software stack and implement the required data structures and algorithms by myself, just as explained in [1] and [2].
One extremely important thing is to have compact data structures for storing the web documents, the lexicon, the forward index and the inverted index. This is because you have to keep a lot of data structures in memory for efficiency reasons. It also makes merging the documents of multiple search keywords by DocID much easier and faster.
My data structures will be similar to those in [1] and [2]. There will be an inverted search index which is sorted by WordID (keyword). For every keyword the inverted index contains a list of matching documents which are sorted by DocID. The lexicon contains an entry for every searchable word and links to the list of appropriate documents in the inverted index. The lexicon and inverted index are generated from the web documents using a MapReduce framework.
If the inverted index is queried for the keywords "earth" and "energy" the lexicon is first asked for the two lists of documents containing these words. Then these two listes are merged using mergesort. The mergesort phase generates a new temporary search result index that is sorted by the page rank of the contained documents. E.g. when a document (DocID) is included in both lists it gets a higher page rank in the temporary search result index for that search query. So it may appear further ahead in the list of search results than documents that only match a single keyword. Besides the number of matching keywords, also some other informations like proximity of keywords are used for calculating the final ranking.
The temporary index for the search keywords is then used to generate the search results page and therefore does not need to be larger than 1000 documents. It may also make sense to cache the temporary index of a search query for some minutes.

Now I have put together all data structures and algorithms to build a working web search engine. However, to build a highly scalable and fast search engine I have to distribute the lexicon and the inverted search index across multiple computers. Therefore, each computer gets a part of the lexicon and the inverted search index. To achieve this, one may either do a term-based partitioning of the inverted index or a document-based partitioning of the inverted index, as described in [3][4] and [5]. I will use the document-based partitioning approach. The overall search result quality must not suffer from partitioning the inverted search index and thus the partitioning algorithm is a little tricky.
With a distributed inverted search index a single search query is performed on multiple computers simultaneously. E.g. when the complete inverted index is distributed across thousands of computers, one search query may be executed on hundreds of them. Thereby each computer is able to perform the search query on its local part of the inverted index (as explained above) very quickly. The temporary search result index (e.g. containing the top-ranked 1000 local documents) of each worker computer is sent to some master computer afterwards. This only requires minimal network bandwidth. The master computer then merges the temporary search result indexes of all worker computers participating in that search query and generates the overall list of best matching documents. This architecture makes the search engine very fast and fault tolerant.

What is really interesting is that processing a single search query gets highly concurrent within the search engine backend to achieve low response times and utilize the available hardware resources efficiently.
I found a website that quotes Marissa Mayer that a single search query on Google is performed by up to 1000 computers. This website also contains the Google I/O 2008 keynote of Marissa Mayer about "How Google Works" which gives some interesting insights into the Google search engine.

One interesting question that comes to my mind is if one could save energy by using Erlang technologies instead of Python and C++ for the search engine backend. Of course Erlang will not help to save energy for the I/O-bound tasks of the search engine backend. But maybe by using Erlang technologies one could achieve the same degree of distribution and concurrency that is needed to run the internet search engine backend with some less computers and therefore less energy. I really don't know if that is possible, but it would be nice to try that out...

18 comments:

  1. ZEIT Online: Wie viel Strom verbraucht das Netz?
    http://www.zeit.de/digital/internet/2011-05/internet-facebook-green-it?page=1

    Facebook: Unfriend Coal: http://youtu.be/QPty-ZLbJt0

    ReplyDelete
  2. Erlang Inventors Talk Language Future: http://www.infoq.com/interviews/armstrong-virding-erlang-future

    ReplyDelete
  3. Google sieht sich als Umweltschutz-Vorbild: http://www.heise.de/newsticker/meldung/Google-sieht-sich-als-Umweltschutz-Vorbild-1339457.html

    ReplyDelete
  4. To answer the question you ask in your last paragraph: it depends. If there is some way that Erlang is able to reduce computational load by lazily not performing some computation that C++ or Python would, you can write C++/Python code to do the same thing. You might get it for free with Erlang, but there is nothing fundamentally more efficient about Erlang.

    To the point, in the programming language benchmark game, Erlang compares poorly to C++ in time and memory behavior: http://shootout.alioth.debian.org/u32/benchmark.php?test=all&lang=hipe&lang2=gpp

    ReplyDelete
  5. Hm, I'm not that sure. You can achieve much more concurrency and easier distribution with less computing power that with other languages that depend on threads for concurrency. That's totally different for Erlang processes and their scheduling. E.g. GitHub uses Erlang a lot for distribution etc. but uses other languages for the rest of their backend system...

    ReplyDelete
  6. Replies
    1. Hi Daniel, I am working to make search engine that show coding results from different sites over the internet. Can you please help me?

      Delete
  7. As for light-weight processes vs OS threads - nothing stops you from using explicit finit-automata or green-threads in C++ program. TBH most high-concurrency apps do.

    ReplyDelete
  8. @ddoarn: That's right. You can do so.
    But one difference between Erlang processes and user-mode threads in C++ is the scheduling mechanism. Erlangs reduction count scheduling is hightly efficient and therefore the context switch costs are minimal. Also the blocking semantics between plain user-mode threads and Erlang processes are very similar. E.g. what happens to the OS thread if the user mode thread (green thread) that currently runs on the OS thread blocks maybe on an I/O task? Will this block all green threads from running on the OS thread?
    Exactly that does not happen with Erlang processes. That's the key to Erlang's concurrency and why it is superior to a user-mode thread library.
    But if you want to get such features you really have to design your system (and the VM) around that principles. That's where Erlang's native bindings ports come into play.

    ReplyDelete
    Replies
    1. It seems that some Google engineers think the same way that I do and figured that out in the Go programming language :-).
      Rob Pike - Concurrency Is Not Parallelism: http://vimeo.com/groups/waza2012/videos/49718712

      Delete
  9. ah, sorry the blocking semantics of user-mode threads and Erlang processes are different not similar :-).

    ReplyDelete
  10. @blocking semantics - ofc you should use async IO, provided by green-threads library.

    Real difference lies on other plane - preemptive vs cooperative multitasking. But for real-world apps those is neglectable. Main problem is - lack of widely adopted quality solution/library for native green threads, so if you wish to use one - you should develop one yourself.

    Don't get me wrong - Erlang is perfect tool for apparent niche. But trying to outperforming good-written C++ code with Erlang is a bit, grhm, naive. It has other strengths.

    ReplyDelete
  11. Well ddoarn, I really don't think so. Not everything is perfect in Erlang but it provides the right abstractions (and good implementations) for concurrency. There are a lot of tasks C++ is better suited and also faster, agreed. But there are other tasks where something like Erlang is better suited. It do not agree that it is only cooperative vs. preemtive multitasking and async IO stuff. But that's a matter of opinion. Future will tell us more :-).

    ReplyDelete
  12. http://en.wikipedia.org/wiki/Erlang_(programming_language)

    http://stackoverflow.com/questions/1947180/whats-the-difference-between-green-threads-and-erlangs-processes

    http://stackoverflow.com/questions/2708033/technically-why-is-processes-in-erlang-more-efficient-than-os-threads

    http://stackoverflow.com/questions/5878231/what-exactly-makes-erlang-process-green-thread-coroutine-lighter-than-kernel

    http://www.cs.ucsb.edu/~puneet/reports/erlang.pdf

    http://www.amazon.com/Modern-Operating-Systems-Andrew-Tanenbaum/dp/0136006639/ref=sr_1_1?ie=UTF8&qid=1324633048&sr=8-1

    ReplyDelete
  13. http://news.ycombinator.com/item?id=873004

    http://en.wikipedia.org/wiki/Green_threads

    ReplyDelete
  14. Erlang's big advantage is the internal scheduler, and a systematic approach to writing code that will never block the Erlang OS process (see http://news.ycombinator.com/item?id=873004).

    What makes Erlang, some microkernel OSes, maybe some VM-based green thread implementations etc. different is the approach to writing code that will never block active entities.

    ReplyDelete
    Replies
    1. We love erlang not because its lightweight process, but its VM&GC which make it soft real time. Erlang's poor performance processing binary and big memory make it a bad choice for db and search index.
      Couchdb will go back to c world. Riak do both db and search, but the underlining store engine is innostore or leveldb(which are wrote by C++). And the riak search is relatively new.
      Use erlang as top-level framework and use c to do the low-level things. This will get better result.

      Delete