Saturday, May 7, 2011

Build your own internet search engine - Part 2

After having started to build my own internet search engine as described in a previous blog post, I now have read some papers and books about web search engine architecture and information retrieval to complete my hobby project. Here is a list of papers and books that I highly recommend to anybody who is interested in this topic:

1. Google: data structures and algorithms by Petteri Huuhka
2. The Anatomy of a Large-Scale Hypertextual Web Search Engine by the Google founders Sergey Brin and Lawrence Page
3. Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze
4. Effect of inverted index partitioning schemes on performance of query processing in parallel text retrieval systems by B. Barla Cambazoglu, Aytul Catal and Cevdet Aykanat
5. Distributed Web Crawling, Indexing, and Search by Ricardo Baeza-Yates and B. Barla Cambazoglu
6. Web Search for a Planet by Luiz André Barroso, Jeffrey Dean and Urs Hölzle
7. Building a Search Engine by David Evans and Sebastian Thrun

As described in my previous blog post I build the whole search engine using Erlang technologies. This worked out extremely well for the search bots. But using CouchDB for storing all the web documents, the forward search index and the inverted search index was a bad idea.
A NoSQL database like CouchDB is great for building a web store like Amazon but it is definitely not good at building a highly scalable web search engine.
The problem is that CouchDB is simply not specialized enough for this task. E.g. for every search query you have to look at millions of documents and rank them appropriately. But if you use something like CouchDB (which has a JSON interface) you just need too much resources of everything (CPU time, memory and network bandwidth) while merging and ranking the documents for multiple search keywords. Now I know that :-).

So I have to remove CouchDB from my search engine software stack and implement the required data structures and algorithms by myself, just as explained in [1] and [2].
One extremely important thing is to have compact data structures for storing the web documents, the lexicon, the forward index and the inverted index. This is because you have to keep a lot of data structures in memory for efficiency reasons. It also makes merging the documents of multiple search keywords by DocID much easier and faster.
My data structures will be similar to those in [1] and [2]. There will be an inverted search index which is sorted by WordID (keyword). For every keyword the inverted index contains a list of matching documents which are sorted by DocID. The lexicon contains an entry for every searchable word and links to the list of appropriate documents in the inverted index. The lexicon and inverted index are generated from the web documents using a MapReduce framework.
If the inverted index is queried for the keywords "earth" and "energy" the lexicon is first asked for the two lists of documents containing these words. Then these two listes are merged using mergesort. The mergesort phase generates a new temporary search result index that is sorted by the page rank of the contained documents. E.g. when a document (DocID) is included in both lists it gets a higher page rank in the temporary search result index for that search query. So it may appear further ahead in the list of search results than documents that only match a single keyword. Besides the number of matching keywords, also some other informations like proximity of keywords are used for calculating the final ranking.
The temporary index for the search keywords is then used to generate the search results page and therefore does not need to be larger than 1000 documents. It may also make sense to cache the temporary index of a search query for some minutes.

Now I have put together all data structures and algorithms to build a working web search engine. However, to build a highly scalable and fast search engine I have to distribute the lexicon and the inverted search index across multiple computers. Therefore, each computer gets a part of the lexicon and the inverted search index. To achieve this, one may either do a term-based partitioning of the inverted index or a document-based partitioning of the inverted index, as described in [3][4] and [5]. I will use the document-based partitioning approach. The overall search result quality must not suffer from partitioning the inverted search index and thus the partitioning algorithm is a little tricky.
With a distributed inverted search index a single search query is performed on multiple computers simultaneously. E.g. when the complete inverted index is distributed across thousands of computers, one search query may be executed on hundreds of them. Thereby each computer is able to perform the search query on its local part of the inverted index (as explained above) very quickly. The temporary search result index (e.g. containing the top-ranked 1000 local documents) of each worker computer is sent to some master computer afterwards. This only requires minimal network bandwidth. The master computer then merges the temporary search result indexes of all worker computers participating in that search query and generates the overall list of best matching documents. This architecture makes the search engine very fast and fault tolerant.

What is really interesting is that processing a single search query gets highly concurrent within the search engine backend to achieve low response times and utilize the available hardware resources efficiently.
I found a website that quotes Marissa Mayer that a single search query on Google is performed by up to 1000 computers. This website also contains the Google I/O 2008 keynote of Marissa Mayer about "How Google Works" which gives some interesting insights into the Google search engine.

One interesting question that comes to my mind is if one could save energy by using Erlang technologies instead of Python and C++ for the search engine backend. Of course Erlang will not help to save energy for the I/O-bound tasks of the search engine backend. But maybe by using Erlang technologies one could achieve the same degree of distribution and concurrency that is needed to run the internet search engine backend with some less computers and therefore less energy. I really don't know if that is possible, but it would be nice to try that out...