Wednesday, April 05, 2006

Tokenizing Scheme Source

To index all words in a file with english, the first thing you need to do is to find all the words. Since I want to index Scheme source, my words are identifiers.

The default reader of MzScheme extends the syntax of R5RS-identifiers, so it is neccessary to read the fine print before one starts coding. Simplifying a little one divides the characters in two classes delimeters and non-delimeters. To find the next token, one simply keeps reading until a non-delimeter is seen. The token is then read char for char until a delimeter is seen. This simplistic approach would work if it were not for identifiers beginning with hash-percent (#%), however it doesn't take much ingenuity to handle this too.

The easiest and most efficient way to implement the above approach in PLT Scheme is to use the builtin regular expressions. Other options were to use the more flexible Perl-style regular expressions or even use parser-tools to generate a custom lexer. The builtin regular expressions have the advantage that they work directly on the input port which makes them very fast. Since PLT Scheme uses the utf-8 encoding to store source code, it is neccessary to use byte-regexps (as opposed to string-based). Another advantage of matching directly on the input port is that the port abstraction keeps track of location information such as line number and column number automatically.

  (define (utf-regexp str)
(byte-regexp (string->bytes/utf-8 str)))

(define whitespace "[ \n\r\t]")
(define brackets "[]\\[\\(\\){}")
(define punctuation "[,'`;]")
(define special "[\\|]")
(define delimeter (string-append "(" whitespace "|" punctuation "|"
brackets "|" special "|" "\"" ")"))
(define delimeter* (string-append delimeter "*"))

(define non-symbol-starter-regexp (utf-regexp delimeter*))

(define (skip-to-next-token)
(regexp-match non-symbol-starter-regexp (current-input-port))
(if (eqv? (peek-char) #\#)
(unless (equal? (peek-string 2 0) "#%")
(read-char)
(skip-to-next-token))))

(define non-delimeter "[^]\\[\\(\\){} \n\r\t,'`;\\|\"]")
(define non-delimeter* (string-append non-delimeter "*"))
(define non-delimeter*-regexp (utf-regexp non-delimeter*))

(define (read-token)
(let* ([pos (file-position (current-input-port))]
[m (regexp-match non-delimeter*-regexp (current-input-port))])
(and m
(match m
[(token . _)
(list token pos)]))))

(define (for-each-token f)
(unless (eof-object? (peek-char))
(skip-to-next-token)
(unless (eof-object? (peek-char))
(let ([token (read-token)])
(if token
(begin
(f token)
(for-each-token f))
(error "internal error: token expected after skipping"))))))

(define (for-each-token-in-file file f)
(with-input-from-file file
(lambda ()
(for-each-token f ))))

0 Comments:

Post a Comment

<< Home