Writing a Spelling Corrector in PLT Scheme
[Remember to leave a comment: Was this post silly? Enlightning? Old news?]
Introduction
Peter Norvig recently wrote a great piece on How to Write a Spelling Corrector. Since Norvig used Python, Shiro decided to write a version in Gauche Scheme. In the following I'll present a solution in PLT Scheme. But first let's look at string manipulations.
String Manipulations
Scheme offers the usual operations on strings: string concatenation (string-append), referencing a character in a string (string-ref), extracting a substring (substring) and converting characters to strings (string). Compared to other languages code for string manipulations tend to become verbose, unless one "cheats" and uses some sort of utility such as format. [The advantage of this verboseness is that Scheme compilers can generate efficient code.]
As an example consider this expression from Norvig's spelling corrector:
The c is a character, and word[i:] is the substring from index i to the end.
The concatenation operator + converts automatically the character c to a string before the strings are concatenated.
A literal translation to Scheme reads as follows:
Ouch!
A utility is clearly needed. Below follows my shot at one such utility, namely, concat. It allows us to write the expression as:
A string followed by two numbers is a substring, so word 0 i is short for (substring word 0 i). An underscore is allowed instead of a number either to indicate the beginning or the end. Thus word i _ is short for (substring word i (string-length word)). Finally a character is automatically converted to a string, so c is short for (string c). Not used in the example, but useful in general: A string followed by a single number, like word i, is short for (string-ref word i).
The code for concat is found at the bottom of this post.
The Spelling Corrector
There are two parts of the spelling corrector. The first part reads in correctly spelled words from a corpus, and counts the number of occurences of each word. The second part uses the training data to check whether a given word is known, or if it is incorrectly spelled to find the intended word.
Reading words from a file
We can split a string in words with the help of regexp-split.
Taking advantage of the fact that PLT Scheme's regular expression functions work both on strings as well as directly on input ports, we can call regexp-split directly on the port.
Counting words
Now we'll count how many times each word is seen. We'll use a hash table to hold words and counts.
Let's read in the data from the small corpus, and at the same time define word-count that returns the count of a word, and known? that checks whether a word was seen in the training set.
The set of strings with edit distance 1
When a word is incorrectly spelled, we want to suggest a "similar" word known to be correctly spelled.
Given a word w, the set of words with edit distance 1 consists of the words, we can generate from w by either deleting one character, or transposing to character, or altering one character, or inserting one character.
Deleting the character with index i from a word is easy with the help of concat:
If n is the length of w, then the set of strings generated by deleting a single character is given by:
Here (: i n) makes i run through the numbers 0, 1, ... n-1. For each i the string (concat word 0 i word (+ i 1) _) is calculated, and all strings are collected in a set.
All words with edit distance 1 is given by:
The set of strings with edit distance 2
Given a word w, the set of words with edit distance 2 is the set generated by 2 deletions, transponations, alterations, or insertions. Luckily we have done the hard work in edits1.
There is just one catch. The set is awfully large, so instead, we'll concentrate on correctly spelled (known) words with a edit distance of two:
Maximum
If we have more than one suggestion for an incorrectly spelled word, we want to find the most common of them - that is the one with the largest word count in the training set. Python's max has a nice feature, where given a list of xs, it finds the x that maximizes f(x), for a function f. Here is a Scheme version:
Correcting a word
To correct a word, we first check whether it is known. We try to find a known word with edit distance 1. If unsuccessful we try to find one with edit distance 2.
Let's try it:
Last Remarks
Were it not for the concat macro, the function edits1 would have been clumsy. I am not sure concat is the answer. Thanks to macros, one can at least experiment with language extensions in Scheme.
Remember to read Norvig's piece, where he explains the math behind.
[Hmm. Should I turn it into a math exercise for my class?]
Remember to leave a comment: Was this post silly? Enlightning? Old news?]
A Spelling Corrector in PLT Scheme
Introduction
Peter Norvig recently wrote a great piece on How to Write a Spelling Corrector. Since Norvig used Python, Shiro decided to write a version in Gauche Scheme. In the following I'll present a solution in PLT Scheme. But first let's look at string manipulations.
String Manipulations
Scheme offers the usual operations on strings: string concatenation (string-append), referencing a character in a string (string-ref), extracting a substring (substring) and converting characters to strings (string). Compared to other languages code for string manipulations tend to become verbose, unless one "cheats" and uses some sort of utility such as format. [The advantage of this verboseness is that Scheme compilers can generate efficient code.]
As an example consider this expression from Norvig's spelling corrector:
[word[0:i]+c+word[i:]Here word[0:i] is the substring from index 0 (inclusive) to index i (exclusive).
The c is a character, and word[i:] is the substring from index i to the end.
The concatenation operator + converts automatically the character c to a string before the strings are concatenated.
A literal translation to Scheme reads as follows:
(string-append (substring word 0 i) (string c) (substring word i (string-length word)))
Ouch!
A utility is clearly needed. Below follows my shot at one such utility, namely, concat. It allows us to write the expression as:
(concat word 0 i c word i _)
A string followed by two numbers is a substring, so word 0 i is short for (substring word 0 i). An underscore is allowed instead of a number either to indicate the beginning or the end. Thus word i _ is short for (substring word i (string-length word)). Finally a character is automatically converted to a string, so c is short for (string c). Not used in the example, but useful in general: A string followed by a single number, like word i, is short for (string-ref word i).
The code for concat is found at the bottom of this post.
The Spelling Corrector
There are two parts of the spelling corrector. The first part reads in correctly spelled words from a corpus, and counts the number of occurences of each word. The second part uses the training data to check whether a given word is known, or if it is incorrectly spelled to find the intended word.
Reading words from a file
We can split a string in words with the help of regexp-split.
> (regexp-split #rx"[^a-zA-Z]" "foo bar baz")
("foo" "bar" "baz")
Taking advantage of the fact that PLT Scheme's regular expression functions work both on strings as well as directly on input ports, we can call regexp-split directly on the port.
; words : port -> list-of-strings
(define (words text)
(list-ec (: word (regexp-split #rx"[^a-zA-Z]" text))
(string-downcase (bytes->string/latin-1 word))))
Counting words
Now we'll count how many times each word is seen. We'll use a hash table to hold words and counts.
; train : list-of-strings -> hash-table-from-strings-to-numbers
(define (train features)
(let ([model (make-hash-table 'equal)])
(do-ec (: word features)
(hash-table-put! model word (add1 (hash-table-get model word 1))))
model))
Let's read in the data from the small corpus, and at the same time define word-count that returns the count of a word, and known? that checks whether a word was seen in the training set.
(define NWORDS (train (words (open-input-file "small.txt"))))
(define (word-count word)
(hash-table-get NWORDS word 0))
(define (known? word)
(hash-table-get NWORDS word #f))
The set of strings with edit distance 1
When a word is incorrectly spelled, we want to suggest a "similar" word known to be correctly spelled.
Given a word w, the set of words with edit distance 1 consists of the words, we can generate from w by either deleting one character, or transposing to character, or altering one character, or inserting one character.
Deleting the character with index i from a word is easy with the help of concat:
(concat word 0 i word (+ i 1) _)
If n is the length of w, then the set of strings generated by deleting a single character is given by:
(set-ec string-compare (: i n) (concat word 0 i word (+ i 1) _))
Here (: i n) makes i run through the numbers 0, 1, ... n-1. For each i the string (concat word 0 i word (+ i 1) _) is calculated, and all strings are collected in a set.
All words with edit distance 1 is given by:
(define (edits1 word)
(define alphabet (string-ec (: c #\a #\z) c))
(define n (string-length word))
(union*
(set-ec string-compare (: i n) (concat word 0 i word (+ i 1) _))
(set-ec string-compare (: i (- n 1)) (concat word 0 i word (+ i 1) word i word (+ i 2) _))
(set-ec string-compare (: i n) (: c alphabet) (concat word 0 i c word (+ i 1) _))
(set-ec string-compare (: i n) (: c alphabet) (concat word 0 i c word i _))))
(define (union* . sets)
(foldl union (empty string-compare) sets))
The set of strings with edit distance 2
Given a word w, the set of words with edit distance 2 is the set generated by 2 deletions, transponations, alterations, or insertions. Luckily we have done the hard work in edits1.
(define (edits2 word)
(set-ec string-compare
(: e1 (elements (edits1 word)))
(: e2 (elements (edits1 e1)))
e2))
There is just one catch. The set is awfully large, so instead, we'll concentrate on correctly spelled (known) words with a edit distance of two:
; known : set-of-strings -> set-of-strings
; remove the unknown words in words
(define (known words)
(set-ec string-compare
(:set w words) (if (known? w)) w))
(define (known-edits2 word)
(set-ec string-compare
(:set e1 (edits1 word))
(:set e2 (edits1 e1))
(if (known? e2))
e2))
Maximum
If we have more than one suggestion for an incorrectly spelled word, we want to find the most common of them - that is the one with the largest word count in the training set. Python's max has a nice feature, where given a list of xs, it finds the x that maximizes f(x), for a function f. Here is a Scheme version:
(define (maximum xs key)
; return the x with the largest key
(define max-key -inf.0)
(define max-x #f)
(do-ec (: x xs)
(:let k (key x))
(if (> k max-key))
(begin
(set! max-key k)
(set! max-x x)))
max-x)
Correcting a word
To correct a word, we first check whether it is known. We try to find a known word with edit distance 1. If unsuccessful we try to find one with edit distance 2.
(define (correct word)
(define (falsify s) (if (empty? s) #f s))
(let ([candidates
(or (falsify (known (set word)))
(falsify (known (edits1 word)))
(falsify (known-edits2 word))
(falsify (list word)))])
(maximum (elements candidates) word-count)))
Let's try it:
> (correct "hose")
"house"
Last Remarks
Were it not for the concat macro, the function edits1 would have been clumsy. I am not sure concat is the answer. Thanks to macros, one can at least experiment with language extensions in Scheme.
Remember to read Norvig's piece, where he explains the math behind.
[Hmm. Should I turn it into a math exercise for my class?]
Remember to leave a comment: Was this post silly? Enlightning? Old news?]
A Spelling Corrector in PLT Scheme
(require (lib "42.ss" "srfi")
(lib "67.ss" "srfi")
(planet "set.scm" ("soegaard" "galore.plt" 2 (= 2))))
; words : port -> list-of-strings
(define (words text)
(list-ec (: word (regexp-split #rx"[^a-zA-Z]" text))
(string-downcase (bytes->string/latin-1 word))))
; train : list-of-strings -> hash-table-from-strings-to-numbers
(define (train features)
(let ([model (make-hash-table 'equal)])
(do-ec (: word features)
(hash-table-put! model word (add1 (hash-table-get model word 1))))
model))
(define file "small.txt")
;(define file "big.txt")
(define NWORDS (train (words (open-input-file file))))
(define (word-count word)
(hash-table-get NWORDS word 0))
(define (known? word)
(hash-table-get NWORDS word #f))
(define (known words)
(set-ec string-compare
(:set w words) (if (known? w)) w))
(define (concat-it spec)
(define (underscore? o) (eq? o '_))
; spec :: = '() | ( string num num . spec ) | (string num _) | |
(define (subs spec)
(match spec
[() '()]
[((? string? s) (? number? n1) (? number? n2) . spec)
(cons (substring s n1 n2) (subs spec))]
[((? string? s) (? number? n1) (? underscore? _) . spec)
(cons (substring s n1 (string-length s)) (subs spec))]
[((? string? s) (? underscore? _) (? number? n2) . spec)
(cons (substring s 0 n2) (subs spec))]
[((? string? s) (? number? n) . spec)
(cons (string (string-ref s n)) (subs spec))]
[((? symbol? s) . spec)
(cons (symbol->string s) (subs spec))]
[((? char? c) . spec)
(cons (string c) (subs spec))]
[((? string? s) . spec)
(cons s (subs spec))]
[else (error)]))
(apply string-append (subs spec)))
(define-syntax (concat stx)
(syntax-case stx (_)
[(string-it spec ...)
#`(concat-it
(list #,@(map (lambda (so)
(syntax-case so (_)
[_ #''_]
[else so]))
(syntax->list #'(spec ...)))))]))
(define (union* . sets)
(foldl union (empty string-compare) sets))
(define (edits1 word)
(define alphabet (string-ec (: c #\a #\z) c))
(define n (string-length word))
(union*
(set-ec string-compare (: i n) (concat word 0 i word (+ i 1) _))
(set-ec string-compare (: i (- n 1)) (concat word 0 i word (+ i 1) word i word (+ i 2) _))
(set-ec string-compare (: i n) (: c alphabet) (concat word 0 i c word (+ i 1) _))
(set-ec string-compare (: i n) (: c alphabet) (concat word 0 i c word i _))))
(define (edits2 word)
(set-ec string-compare
(: e1 (elements (edits1 word)))
(: e2 (elements (edits1 e1)))
e2))
(define (known-edits2 word)
(set-ec string-compare
(:set e1 (edits1 word))
(:set e2 (edits1 e1))
(if (known? e2))
e2))
(define (maximum xs key)
; return the x with the largest key
(define max-key -inf.0)
(define max-x #f)
(do-ec (: x xs)
(:let k (key x))
(if (> k max-key))
(begin
(set! max-key k)
(set! max-x x)))
max-x)
(define (correct word)
(define (falsify s) (if (empty? s) #f s))
(let ([candidates
(or (falsify (known (set word)))
(falsify (known (edits1 word)))
(falsify (known-edits2 word))
(falsify (list->set (list word))))])
(maximum (elements candidates) word-count)))