Sunday, April 16, 2006

Index Construction

Update: The last part were rewritten.

Sort-based index construction

The implementation of the indexer have progressed better than I expected. As a warmup to the implementation of the "sort-based multiway compressed" algorithm I have implemented both the "sort-based" and the "sort-based compressed" algorithm. They both work in four phases:
  1. For all terms in all documents a record (t, d, f) of the term number, the document number d and the frequency f of the term t in the document d is made in a temporary file.

  2. The temporary file is divided into blocks, which are sorted in-memory.

  3. A series of runs are made. Each run merges two sorted blocks. The end result is a large file consisting of (t,d,f) triples sorted in lexicographic order.

  4. Based on the (t,d,f) triples the index is constructed. For each term number t an inverted list of the form (t n d1 f1 di f2 ... dn fn) is written to the index file. Here n is the number of documents in which the term occurs; di and fi are associated document numbers and frequencies.
During the tokenizing in phase 1 a lexicon is made. The lexicon associates terms and term numbers. Furthermore during phase 4 the location of the term's inverted list in the index file is saved.

Searching for a term is simple: Use the lexicon to look up the location of the inverted list in the index file. Read the inverted list. Present the result - the frequencies are used to rank the documents after relevancy.

To keep the size of the index and the temporary file down the numbers aren't stored as standard 32 bit integers. Since an integer N requires only ceil(log(N)) bits, 32 bit integers waste a lot of bits for small numbers. Instead more flexible encodings are used in which small numbers are represented in fewer bits than larger numbers [More about bit encodings in a later post].

Sort-based index construction with compression

A standard compression trick is to use gaps to store ascending lists. Say a term is found in documents (2 6 9 13) then the corresponding gaps are (2 4 3 4). The original document numbers are found by adding the gaps, e.g. 9 = 2+4+3. The important thing to note is that the gaps are smaller than the original numbers and thus can be stored in fewer bits (if an approriate encoding is chosen).

This observation can be used to store the inverted lists, since the inverted list of a term (t n d1 f1 di f2 ... dn fn) has ascending document numbers di. That is, changing the representation of an inverted list to (t n dg1 f1 dgi f2 ... dgn fn), where dg stands for document gap, will reduce the size of the final index. The alert reader will note that the same trick applies to the term numbers in the final index. The extra alert reader will note that all terms are present in the final index and each term occurs only once, so we can omit the term number t completely from the final index.

Compressing the final index file will have relatively little effect on the total run time of the indexer. The bulk of the work is namely done in phase 3 during the external sorting of the temporary files consisting of the (t, d, f) triples. Now compressing these temporary files would reduce i/o considerably and hence also the run time. Can we use the same trick again? We sure can - the temporary files consists of blocks of triples and within each block the term numbers t occur in ascending order. The only complication of changing the representation from (t, d, f) to (tg,d,f), where tg is the term number gap, is some additional bookkeeping during merging.

The difference between the "sort-based" and the "sort-based compression" algorithm is thus that the latter compresses the temporary files.

One last observation: By fusing phase 1 and 2 a pass through the temporary file is saved.



Blogger Unknown said...

your understanding of unary coding is wrong. Please check with a book before you blog

Blogger Jens Axel Søgaard said...

Hi Raj,



Post a Comment

<< Home