Tuesday, April 04, 2006

Back of the Envelope

Managing Gigabytes identify the following key parameters of a search engine:

B Total text size
N Number of documents
n Number of distinct words
F Total number of words
f Number of index pointers
I Final size of the compressed inverted file
L Size of dynamic lexicon structure

These parameters together with estimations of disk seek time, disk transfer time per byte etc. are used to make estimations on the ressource usage of the search engine.

My initial plan is to index all source files of the PLT Source Browser (temporary url). In round numbers there are N=5200 files whose total size is 40M. Tokenizing the files reveal that the total number of words are F=3.7 million and that the number of distinct words are n=120.000.

Assuming one word entry will take up 10 bytes 10F=37MBy is needed for the entire index. For so small indexes the naïve approach (linked lists in memory) to indexing would be fine. I'll stick to my decision of using the sort-based compressed approach though. It will be able to run on smaller machines with larger data sets - and the algorithms are more interesting.


Post a Comment

<< Home