Thursday, March 30, 2006

The Crux of a Search Engine

The job of a search engine is given a term to find all documents within a document collection which contain the term. To answer queries in a reasonable amount of time even in the presence of a vast document collection, an index of all search terms is prepared in advance. The index is in its simplest form nothing more than a sorted list of search terms; where each term has the list of the documents, wherein it is found, associated.

If the total size of the document collection is so small that the entire index fits into main memory it is straightforward to generate the index. The real challenge is to handle large document collections on small machines. Using a harddisk to store the index trades memory for time. Compressing the stored index will keep the space usage down. If the CPU is fast compared to disk throughput it is quite possible that compressing the index will achieve a time saving too.

The book Managing Gigabytes by Ian H. Witten, Alistair Moffat and Timothy C. Bell has the following table over predicted resource requirements to invert (i.e. build an index) a document collection with a total size of 5 Gbytes. The algorithms are more-or-less listed in order of increasing sophistication. The somewhat arbitrary limit of 40 Mbytes of main memory stems from the first edition of the book from 1993.

Method Memory (Mbytes) Disk (Mbytes) Time (hours)
Linked lists (memory) 4,000 0 6
Linked lists (disk) 30 4,000 1,100
Sort-based 40 8,000 20
Sort-based compressed 40 680 26
Sort-based multiway compressed 40 540 11
In-memory compressed 420 1 12
Lexcion-based, no extra disk 0 0 79
Lexcion-based, extra disk 40 4,000 12
Text-based partition 40 35 15

The first two entries illustrate the the space/time tradeoff really well.

The "Linked lists (memory)" algorithm builds an index in memory in the naïve way: all terms are stored as keys in a hash table, their associated value is the list of documents containing the term. It uses the almost as much memory as the entire collection.

The "Linked lists (disk)" uses the same approach, but stores the linked lists on a disk. Memory usage is fine, but due to the speed difference between main memory and disk the time usage explodes to the unreasonable.

The "Sort-based" approach builds the index in three tempi. First a temporary file consisting of (term-number, document-number) tuples are generated by looping over all terms in all documents. Second, the temporary file is sorted by an external sort; two tuples are compared by looking at the terms, the document number is used to break ties. Finally the sorted file is used to generate the index. This approach brings the execution time down in a reasonable range again. Due to tempoary file the space usage is almost twice the size of the total collection.

The "Sort-based compressed" algorithm refines the previous technique and reduces the disk space needed by using compression techniques.

"Sort-based multiway compressed" improves upon the sorting methods with much improved time usage as consequence.

For this project I'll implement the "Sort-based compressed" method.

Wednesday, March 29, 2006

Everything Scheme

This blog will be about all things Scheme.

The main theme will in the beginning be musings about building a search engine with Scheme. Having a concrete project to blog about will hopefully improve both the quality and the frequency of posts. I am easily amused, so expect diversions along the way.

But - why on earth write a search engine?

Recently I hacked together the PLT Source Browser which enables you to browse the source code for all packages submitted to the PLaneT Package Repository as well as the source for the various builtin collections (libraries) of PLT Scheme. The source browser is a great tool to study other peoples code - that is - if you know where to look. Looking for usages of a particular functions is at the moment difficult. Adding a Google SiteSearch helped a little, but only a little. It is a problem that the Google index is never completely up-to-date, that Google interprets "-" as a delimeter which leads to false hits, and finally that there is no way to help Google rank the various hits.

Whether the need for a custom search engine is perceived or not, the inner workings of a search engine is a fun playground for both algorithms for compression as well as for datastructures. What more can one wish for in a project?