Scheme Search gets pattern matching
This week I had time to work on the search engine again.
From a user view point the most important change is the addition of pattern matching. Until now it was possible to find all documents, where a particular identifier occurs. If on the other hand you were unsure which identifier to search for, you were out of luck. Now you can search for occurrences of identifiers matching a regular expression.
Say you vaguely remember someone using a define- something - struct. A pattern match search for define alone gives 6993 hits. But a search for define-.*-struct gives only 22 hits (the first of which contains define-serializable-struct).
The implementation of the pattern matching search is kept very simple. There are only 150.000 search terms, so the naïve approach of simply matching all terms against the pattern one at a time is fast enough. At least for the moment... After the matching terms have been found, they are looked up in the index and finally the list of documents are ranked.
Under the hood the representation of the lexicon changed from an in-memory hash-table to a disk-based representation. This has two advantages: the web-server uses less memory and the lexicon uses less space on disk (although it resides in-memory, it was read in, when the web-server started). The disadvantage is that searches now requires disk-access. To keep disk access to a minimum, the lexicon is read in blocks of 100 terms, and a few recently used blocks are cached. If there is a need to look up several terms, it is now best to look them up in alphabetical order.
The plan is to use the disk space saved for more indexes. Perhaps an index over the documentation a la the HelpDesk? Unfortunately I haven't got enough disk-space enough to implement "preview" of search results.
In other news Google has released their Google Code Search. It indexes source from Sourceforge, Google Code and other public repositories.
To search for Scheme source add
Try the above and search for "srfi" to see the how many Scheme projects they have found.
From a user view point the most important change is the addition of pattern matching. Until now it was possible to find all documents, where a particular identifier occurs. If on the other hand you were unsure which identifier to search for, you were out of luck. Now you can search for occurrences of identifiers matching a regular expression.
Say you vaguely remember someone using a define- something - struct. A pattern match search for define alone gives 6993 hits. But a search for define-.*-struct gives only 22 hits (the first of which contains define-serializable-struct).
The implementation of the pattern matching search is kept very simple. There are only 150.000 search terms, so the naïve approach of simply matching all terms against the pattern one at a time is fast enough. At least for the moment... After the matching terms have been found, they are looked up in the index and finally the list of documents are ranked.
Under the hood the representation of the lexicon changed from an in-memory hash-table to a disk-based representation. This has two advantages: the web-server uses less memory and the lexicon uses less space on disk (although it resides in-memory, it was read in, when the web-server started). The disadvantage is that searches now requires disk-access. To keep disk access to a minimum, the lexicon is read in blocks of 100 terms, and a few recently used blocks are cached. If there is a need to look up several terms, it is now best to look them up in alphabetical order.
The plan is to use the disk space saved for more indexes. Perhaps an index over the documentation a la the HelpDesk? Unfortunately I haven't got enough disk-space enough to implement "preview" of search results.
In other news Google has released their Google Code Search. It indexes source from Sourceforge, Google Code and other public repositories.
To search for Scheme source add
file:(.scm|.ss|.sch)
to the beginning of your query.Try the above and search for "srfi" to see the how many Scheme projects they have found.
Labels: search engine
0 Comments:
Post a Comment
<< Home