Sunday, April 29, 2007

Writing a Spelling Corrector in PLT Scheme

[Remember to leave a comment: Was this post silly? Enlightning? Old news?]

Introduction

Peter Norvig recently wrote a great piece on How to Write a Spelling Corrector. Since Norvig used Python, Shiro decided to write a version in Gauche Scheme. In the following I'll present a solution in PLT Scheme. But first let's look at string manipulations.

String Manipulations

Scheme offers the usual operations on strings: string concatenation (string-append), referencing a character in a string (string-ref), extracting a substring (substring) and converting characters to strings (string). Compared to other languages code for string manipulations tend to become verbose, unless one "cheats" and uses some sort of utility such as format. [The advantage of this verboseness is that Scheme compilers can generate efficient code.]

As an example consider this expression from Norvig's spelling corrector:
  [word[0:i]+c+word[i:]
Here word[0:i] is the substring from index 0 (inclusive) to index i (exclusive).
The c is a character, and word[i:] is the substring from index i to the end.
The concatenation operator + converts automatically the character c to a string before the strings are concatenated.

A literal translation to Scheme reads as follows:
 (string-append (substring word 0 i) (string c) (substring word i (string-length word)))

Ouch!

A utility is clearly needed. Below follows my shot at one such utility, namely, concat. It allows us to write the expression as:
(concat word 0 i  c  word i _)

A string followed by two numbers is a substring, so word 0 i is short for (substring word 0 i). An underscore is allowed instead of a number either to indicate the beginning or the end. Thus word i _ is short for (substring word i (string-length word)). Finally a character is automatically converted to a string, so c is short for (string c). Not used in the example, but useful in general: A string followed by a single number, like word i, is short for (string-ref word i).

The code for concat is found at the bottom of this post.

The Spelling Corrector

There are two parts of the spelling corrector. The first part reads in correctly spelled words from a corpus, and counts the number of occurences of each word. The second part uses the training data to check whether a given word is known, or if it is incorrectly spelled to find the intended word.

Reading words from a file

We can split a string in words with the help of regexp-split.
 > (regexp-split #rx"[^a-zA-Z]" "foo bar baz")
("foo" "bar" "baz")

Taking advantage of the fact that PLT Scheme's regular expression functions work both on strings as well as directly on input ports, we can call regexp-split directly on the port.
; words : port -> list-of-strings
(define (words text)
(list-ec (: word (regexp-split #rx"[^a-zA-Z]" text))
(string-downcase (bytes->string/latin-1 word))))

Counting words

Now we'll count how many times each word is seen. We'll use a hash table to hold words and counts.

; train : list-of-strings -> hash-table-from-strings-to-numbers
(define (train features)
(let ([model (make-hash-table 'equal)])
(do-ec (: word features)
(hash-table-put! model word (add1 (hash-table-get model word 1))))
model))

Let's read in the data from the small corpus, and at the same time define word-count that returns the count of a word, and known? that checks whether a word was seen in the training set.
(define NWORDS (train (words (open-input-file "small.txt"))))

(define (word-count word)
(hash-table-get NWORDS word 0))

(define (known? word)
(hash-table-get NWORDS word #f))

The set of strings with edit distance 1

When a word is incorrectly spelled, we want to suggest a "similar" word known to be correctly spelled.

Given a word w, the set of words with edit distance 1 consists of the words, we can generate from w by either deleting one character, or transposing to character, or altering one character, or inserting one character.

Deleting the character with index i from a word is easy with the help of concat:
   (concat word 0 i     word (+ i 1) _)

If n is the length of w, then the set of strings generated by deleting a single character is given by:
 (set-ec string-compare (: i n) (concat word 0 i     word (+ i 1) _))

Here (: i n) makes i run through the numbers 0, 1, ... n-1. For each i the string (concat word 0 i word (+ i 1) _) is calculated, and all strings are collected in a set.

All words with edit distance 1 is given by:
(define (edits1 word)
(define alphabet (string-ec (: c #\a #\z) c))
(define n (string-length word))
(union*
(set-ec string-compare (: i n) (concat word 0 i word (+ i 1) _))
(set-ec string-compare (: i (- n 1)) (concat word 0 i word (+ i 1) word i word (+ i 2) _))
(set-ec string-compare (: i n) (: c alphabet) (concat word 0 i c word (+ i 1) _))
(set-ec string-compare (: i n) (: c alphabet) (concat word 0 i c word i _))))

(define (union* . sets)
(foldl union (empty string-compare) sets))

The set of strings with edit distance 2

Given a word w, the set of words with edit distance 2 is the set generated by 2 deletions, transponations, alterations, or insertions. Luckily we have done the hard work in edits1.
(define (edits2 word)
(set-ec string-compare
(: e1 (elements (edits1 word)))
(: e2 (elements (edits1 e1)))
e2))

There is just one catch. The set is awfully large, so instead, we'll concentrate on correctly spelled (known) words with a edit distance of two:
; known : set-of-strings -> set-of-strings
; remove the unknown words in words
(define (known words)
(set-ec string-compare
(:set w words) (if (known? w)) w))

(define (known-edits2 word)
(set-ec string-compare
(:set e1 (edits1 word))
(:set e2 (edits1 e1))
(if (known? e2))
e2))

Maximum

If we have more than one suggestion for an incorrectly spelled word, we want to find the most common of them - that is the one with the largest word count in the training set. Python's max has a nice feature, where given a list of xs, it finds the x that maximizes f(x), for a function f. Here is a Scheme version:
(define (maximum xs key)
; return the x with the largest key
(define max-key -inf.0)
(define max-x #f)
(do-ec (: x xs)
(:let k (key x))
(if (> k max-key))
(begin
(set! max-key k)
(set! max-x x)))
max-x)

Correcting a word

To correct a word, we first check whether it is known. We try to find a known word with edit distance 1. If unsuccessful we try to find one with edit distance 2.
(define (correct word)
(define (falsify s) (if (empty? s) #f s))
(let ([candidates
(or (falsify (known (set word)))
(falsify (known (edits1 word)))
(falsify (known-edits2 word))
(falsify (list word)))])
(maximum (elements candidates) word-count)))


Let's try it:
 > (correct "hose")
"house"


Last Remarks

Were it not for the concat macro, the function edits1 would have been clumsy. I am not sure concat is the answer. Thanks to macros, one can at least experiment with language extensions in Scheme.

Remember to read Norvig's piece, where he explains the math behind.
[Hmm. Should I turn it into a math exercise for my class?]


Remember to leave a comment: Was this post silly? Enlightning? Old news?]

A Spelling Corrector in PLT Scheme

(require (lib "42.ss" "srfi")
(lib "67.ss" "srfi")
(planet "set.scm" ("soegaard" "galore.plt" 2 (= 2))))

; words : port -> list-of-strings
(define (words text)
(list-ec (: word (regexp-split #rx"[^a-zA-Z]" text))
(string-downcase (bytes->string/latin-1 word))))

; train : list-of-strings -> hash-table-from-strings-to-numbers
(define (train features)
(let ([model (make-hash-table 'equal)])
(do-ec (: word features)
(hash-table-put! model word (add1 (hash-table-get model word 1))))
model))

(define file "small.txt")
;(define file "big.txt")
(define NWORDS (train (words (open-input-file file))))

(define (word-count word)
(hash-table-get NWORDS word 0))

(define (known? word)
(hash-table-get NWORDS word #f))

(define (known words)
(set-ec string-compare
(:set w words) (if (known? w)) w))

(define (concat-it spec)
(define (underscore? o) (eq? o '_))
; spec :: = '() | ( string num num . spec ) | (string num _) | |
(define (subs spec)
(match spec
[() '()]
[((? string? s) (? number? n1) (? number? n2) . spec)
(cons (substring s n1 n2) (subs spec))]
[((? string? s) (? number? n1) (? underscore? _) . spec)
(cons (substring s n1 (string-length s)) (subs spec))]
[((? string? s) (? underscore? _) (? number? n2) . spec)
(cons (substring s 0 n2) (subs spec))]
[((? string? s) (? number? n) . spec)
(cons (string (string-ref s n)) (subs spec))]
[((? symbol? s) . spec)
(cons (symbol->string s) (subs spec))]
[((? char? c) . spec)
(cons (string c) (subs spec))]
[((? string? s) . spec)
(cons s (subs spec))]
[else (error)]))
(apply string-append (subs spec)))

(define-syntax (concat stx)
(syntax-case stx (_)
[(string-it spec ...)
#`(concat-it
(list #,@(map (lambda (so)
(syntax-case so (_)
[_ #''_]
[else so]))
(syntax->list #'(spec ...)))))]))

(define (union* . sets)
(foldl union (empty string-compare) sets))

(define (edits1 word)
(define alphabet (string-ec (: c #\a #\z) c))
(define n (string-length word))
(union*
(set-ec string-compare (: i n) (concat word 0 i word (+ i 1) _))
(set-ec string-compare (: i (- n 1)) (concat word 0 i word (+ i 1) word i word (+ i 2) _))
(set-ec string-compare (: i n) (: c alphabet) (concat word 0 i c word (+ i 1) _))
(set-ec string-compare (: i n) (: c alphabet) (concat word 0 i c word i _))))

(define (edits2 word)
(set-ec string-compare
(: e1 (elements (edits1 word)))
(: e2 (elements (edits1 e1)))
e2))

(define (known-edits2 word)
(set-ec string-compare
(:set e1 (edits1 word))
(:set e2 (edits1 e1))
(if (known? e2))
e2))

(define (maximum xs key)
; return the x with the largest key
(define max-key -inf.0)
(define max-x #f)
(do-ec (: x xs)
(:let k (key x))
(if (> k max-key))
(begin
(set! max-key k)
(set! max-x x)))
max-x)

(define (correct word)
(define (falsify s) (if (empty? s) #f s))
(let ([candidates
(or (falsify (known (set word)))
(falsify (known (edits1 word)))
(falsify (known-edits2 word))
(falsify (list->set (list word))))])
(maximum (elements candidates) word-count)))

Wednesday, April 25, 2007

An Introduction to Web Development with PLT Scheme - An Intermezzo

Intermezzo

This is the third post in a series programming web applications with PLT Scheme. In part one and two the model and view of a small reddit-like application were discussed. Before turning to the control part of this application, we need to look at the basic operation of the PLT Scheme web server.

Requests and responses - the big picture

What happens when a user clicks on a link to your web-site? The browser sends a request to the web-server. The request holds among other information the protocol version and the uri (uniform resource identifier) of the requested resource. The web-server must now compute a response and send it back.

How the response is computed depends on the requested resource. In the case of static content, the response of the web-server consists of a few headers (including the mime-type of the file) and a verbatim copy of the file.

Servlets

Dynamic content is handled by servlets. A servlet is a small program that given a representation of a request computes a response, which the web-server then sends back. The standard "Hello World"-servlet reads as follows:
(module hello-world mzscheme
(provide interface-version timeout start)
(define interface-version 'v1)
(define timeout +inf.0)

; start : request -> response
(define (start initial-request)
`(html (head (title "Hello World"))
(body ([bgcolor "white"])
(p "Coffee is bliss.")))))

A module based servlet must provide a start function, that given a request computes a response. In this case the response is XHTML represented as an X-expression. If your fancy is classical HTML as strings, you can return a list, whose first element is the mime type as a byte string, and whose second element is a list of strings (or byte strings) with the lines of the document. For more advanced purposes (such as generating incremental responses) one can return a response structure - see the documentation for details.

A little tip: If you place a call to report-errors-to-browser at the beginning of start, then the web server will catch any otherwise uncaught exceptions, generate an error page (with source location information) and send it to the client. This is considerably more convenient than constantly rumaging through log files.

Requests

Back to the request structure. A request structure holds:
  • the method (get, post, ...)
  • the uri of the resource
  • the headers (including cookies)
  • bindings (from name value pairs either from a posted form or from the query url itself)
  • the host-ip (in case the same web server instance handles more than one ip)
  • the client ip
  • raw data from a post operations
The net collection contains many valuable utilities to deal with uri, headers, cookies etc. I have collected these and other utilities in a little web framework available from PLaneT. It also contains some syntax that hopefully, makes it a bit easier to write servlets. The philosophy is to let the servlet parse the request and store the information in parameters. Similarly parameters hold what will become assembled to a response, when the servlet is finished. For simple servlets, like the one above, nothing much is gained, but more advanced usage patterns like post-redirect-get become much easier to code. The last part of this series will use the framework to implement the control.

Try the PLT Web Server

The PLT Web Server is included in the DrScheme download. If you start "PLT Web Server.exe" (Windows) or "web-server-text" (unix and mac), then the static content will be served from "c:/Programs/PLT/collects/web-server/default-web-root/htdocs" and the servlets are in will be in "...default-web-root/servlets/". A few examples are included.

Tuesday, April 24, 2007

Fun with macros - extending application to support vector notation

[Remember to leave a comment: Was this post silly? Enligthning? Old news?]

"Scheme vector notation sucks" - a common complaint often heard amongst children and drunks. They might even be right. The sum of a three element vector is calculated like this:
 (define a (vector 1 2 3))
(+ (vector-ref a 0) (vector-ref a 1) (vector-ref a 2))

It looks like they are right. Can we repair this with some macro magic?

Let's first consider what syntax we want to implement. Maybe something like this:
 (+ a[0] a[1] a[2])

Not bad, but in most Scheme implementations the parentheses () and [] mean the same.
How about
  (+ (a _0)  (a _1)  (a _2))

then? That we can implement. In PLT Scheme the above is parsed as the
application:
  (#%app + (#%app a _0) (#%app a _1) (#%app a _2))

All we need to do now, is to implement a new application, that takes special action
when the second argument begins with an underscore. In the following code underscore-symbol? is a function that checks whether a symbol starts with an underscore, and remove-underscore removes an initial underscore from a symbol.

The code also handles the case where the subscript is an identifier, such as in (v _n).
(define-syntax (my-app stx)
(syntax-case stx ()
[(_ expr sub)
(and (identifier? #'sub)
(underscore-symbol? (syntax-e #'sub)))
(let* ([sub (remove-underscore (syntax-e #'sub))]
[num (string->number (symbol->string sub))])
(cond
[(and (number? num) (or (zero? num) (positive? num)))
(quasisyntax/loc stx
(vector-ref expr #,(datum->syntax-object #'sub num)))]
[else
(quasisyntax/loc stx
(vector-ref expr #,(datum->syntax-object stx sub)))]))]
[(_ more ...)
#'(#%app more ...)]
[_
(raise-syntax-error 'my-app "bad syntax, in ~a" stx)]))


Notice the use of quasisyntax/loc instead of plain quasisyntax.
If a user writes (v _4) and v is of size 3, then an error must be signaled.
The culprit is the vector-ref expression, so if plain quasisyntax is used,
DrScheme will color the vector-ref expression red - and the user will think
the macro code is faulty. Using quasisyntax/loc allows us to report errors
to the usage of the macro.
  (quasisyntax/loc stx
(vector-ref expr #,(datum->syntax-object #'sub num)))

We also need a notation for the case, where the subscript is a general expression.
That's fortunately easy. Add _ to the list of literals of the syntax-case expression, and add this clause:
      [(__ expr _ sub)
(syntax/loc stx
(vector-ref expr sub))]

Finally we also need a convenient way to make assignments to vectors. The syntax
(v 2 := 42) is easy to implement. And just in case people forget, we also support (v _2 := 42).
The entire source follows.

Remember to leave a comment: Was this post silly? Enligthning? Old news?

; vector-hack.scm   --  Jens Axel Søgaard
(module vector-hack mzscheme
(provide (rename my-app #%app ))

(begin-for-syntax
(define (underscore-symbol? sym)
(and (symbol? sym)
(let ([str (symbol->string sym)])
(> (string-length str) 0)
(equal? (string-ref str 0) #\_))))

(define (remove-underscore sym)
(if (underscore-symbol? sym)
(let ([str (symbol->string sym)])
(string->symbol (substring str 1 (string-length str))))
(error 'remove-underscore "symbol doesn't start with an underscore: ~a" sym))))


(define-syntax (my-app stx)
(syntax-case stx (_ :=)
[(__ expr sub)
(and (identifier? #'sub)
(underscore-symbol? (syntax-e #'sub)))
(let* ([sub (remove-underscore (syntax-e #'sub))]
[num (string->number (symbol->string sub))])
(cond
[(and (number? num) (or (zero? num) (positive? num)))
(quasisyntax/loc stx
(vector-ref expr #,(datum->syntax-object #'sub num)))]
[else
(quasisyntax/loc stx
(vector-ref expr #,(datum->syntax-object stx sub)))]))]
[(__ expr _ sub)
(syntax/loc stx
(vector-ref expr sub))]
[(__ expr sub := value-expr)
(and (identifier? #'sub)
(underscore-symbol? (syntax-e #'sub)))
(let* ([sub (remove-underscore (syntax-e #'sub))]
[num (string->number (symbol->string sub))])
(cond
[(and (number? num) (or (zero? num) (positive? num)))
(quasisyntax/loc stx
(vector-set! expr #,(datum->syntax-object #'sub num) value-expr))]
[else
(quasisyntax/loc stx
(vector-set! expr #,(datum->syntax-object stx sub) value-expr))]))]
[(__ expr sub := value-expr)
(syntax/loc stx
(vector-set! expr sub value-expr))]
[(__ more ...)
#'(#%app more ...)]
[__
(raise-syntax-error 'my-app "bad syntax, in ~a" stx)])))


(require vector-hack)
(define v (vector 0 1 2))
(v _0) ; => 0
(v _1) ; => 1
(v _2) ; => 2
; (my-app v _3) ; => error, correct syntax location

(define a 1)
(v _a) ; => 1
(+ (v _0) (v _1) (v _2)) ; => 3

v ; => #(0 1 2)
(v _0 := 3) ; #<void>
v ; => #(3 1 2)

(v 0 := 42) ; void
v ; => #(42 1 2)