Wednesday, April 26, 2006

Compression: Integer Encodings

Quick! How many bits are needed to store a natural number n?

Well, let's look at the standard encoding. One bit is needed for 1 (1). Two bits are needed for 2 (10) and 3 (11), furthermore three bits are needed for 4 (100), ..., 7 (111). Hmm. log(1)=0, log(2)=1, log(4)=3, log(8)=4. So the number of bits needed is 1+floor(log n). Or is it?

Suppose the bits 110 are read from a file. Is this the number 3 or is it a 1 followed by a 2? We can't know unless we a priori know how many bits were used to store the number.

Leaving the standard encoding aside for a moment, consider instead the unary encoding.

nunary
10
210
3110
41110


The natural number is represented as n-1 1-bits followed by a single 0-bit.
The unary encoding does not have the same problem as before - it is a prefix-free code. A code for a number is never a prefix for the code of a different number.

The unary encoding is useful in situations where the majority of numbers are very small. For large numbers it is horribly ineffecient.

The problem with the standard encoding was, that we didn't know where the code ended. Consider now for the sake of argument encoding a number n by the unary code for the length of the standard encoding (1+floor(log n) followed by the standard encoding. The total length of that code is 2*(1+floor(log n). For large numbers this much better than the unary code.

A small observation saves a bit. Consider the encoding of 4: 110 100. The encoding starts with the unary encoding of 3, the length of the standard encoding. Since all numbers of length 3 have the bit pattern 1xx it is unncessary to actually store that bit. Thus we can just store 4 as 110 00. This bit saving encoding is called the gamma encoding.

To sum up, the gamma encoding of a natural number n consists of the unary code for 1+floor(log n) followed by the floor(log n) bits representing n-2^floor(log n)) in binary. The number of bits used by the gamma code is 2*floor(log n))+1.

For numbers smaller than 5 the unary encoding uses fewer bits than the gamma code, for 5 they use the same number, and for numbers larger than 5 the gamma codes uses fewer bits than the unary code.


A variation is of the gamma code is the delta code. The reasoning goes as follows: For numbers with a length larger than 5 in the standard encoding, it would be better store the length as a gamma code than a unary code. That is, the delta code for a natural number n is consists of the gamma code for the length of the standard encoding of n, followed by the standard encoding.

For numbers below 32 the gamma code is shorter than the delta code. For numbers between 32 and 53 the codes have the same length. The delta code for 64 and larger numbers are shorter than the gamma code.

A small excerpt from the implementation of these encodings - mail me if you are interested in a copy. [The Blogger software inserts spurious newlines in the atom feed. See Everything Scheme for the original.]

  (require "../bit-io/bit-io.scm"
(planet "42.ss" ("soegaard" "srfi.plt")))

;;;
;;; UNARY CODE
;;;

; The unary code for an integer n>=1 is n-1 one bits followed by a zero bit.
; The code for 3 is 110.

(define write-unary
(case-lambda
[(n)
(write-unary n (current-output-bit-port))]
[(n out-bit-port)
(unless (and (integer? n) (positive? n))
(error #f "a positive integer was expected, got: " n))
(if (> n 1)
(write-bits (sub1 n)
(sub1 (arithmetic-shift 2 (sub1 (sub1 n))))
out-bit-port))
(write-bits 1 0 out-bit-port)]))

(define read-unary
(case-lambda
[()
(read-unary (current-input-bit-port))]
[(in-bit-port)
(do ([n 1 (+ n 1)])
[(= (read-bits 1 in-bit-port) 0)
n])]))

Labels:

Sunday, April 23, 2006

Mini Kanren and the Da Vinci Quest, part II

Today the second symbol puzzle in the Da Vinci Quest was revealed. The puzzle was of the same type as described in Mini Kanren and the Da Vinci Quest - the only twist was that the board was enlarged to 5x5. The modified program for the puzzle, I got, is:

(define distinct
(lambda (x1 x2 x3 x4 x5)
(let ([xs (list x1 x2 x3 x4 x5)])
(all
(membero 'omega xs)
(membero 'blade xs)
(membero 'gcross xs)
(membero 'fleur xs)
(membero 'cross xs)))))

(run* (b)
(fresh ( x1 x2 x3 x4 x5
x6 x7 x8 x9 x10
x11 x12 x13 x14 x15
x16 x17 x18 x19 x20
x21 x22 x23 x24 x25)
(alli
; the board consists of fields fields
(== b (list (list x1 x2 x3 x4 x5)
(list x6 x7 x8 x9 x10)
(list x11 x12 x13 x14 x15)
(list x16 x17 x18 x19 x20)
(list x21 x22 x23 x24 x25)))
; known symbols
(== b (list (list x1 'gcross x3 'fleur x5)
(list x6 x7 'cross 'blade x10)
(list x11 'omega x13 x14 'cross)
(list x16 x17 'fleur x19 x20)
(list x21 'cross x23 x24 x25)))
; symbols in rows are distinct
(distinct x1 x2 x3 x4 x5)
(distinct x6 x7 x8 x9 x10)
(distinct x11 x12 x13 x14 x15)
(distinct x16 x17 x18 x19 x20)
(distinct x21 x22 x23 x24 x25)
; symbols in columns are distinct
(distinct x1 x6 x11 x16 x21)
(distinct x2 x7 x12 x17 x22)
(distinct x3 x8 x13 x18 x23)
(distinct x4 x9 x14 x19 x24)
(distinct x5 x10 x15 x20 x25)
; symbols in each colored area are distinct
(distinct x1 x2 x7 x12 x13)
(distinct x3 x8 x4 x5 x10)
(distinct x6 x11 x16 x21 x22)
(distinct x17 x18 x19 x23 x24))))

Friday, April 21, 2006

Reading and writing bits - bit-io.plt

As mentioned in "Index Construction" compression is used to store the index in a space efficient manner. The number of bits used to encode integers vary according to their size. The standard file operations read and write bytes, so the need for a bit level i/o library arose.

A little shopping around led to Oleg Kiselyov's page on Binary I/O. He presents a very nice solution to the problem of reading bit streams. The function make-bit-reader turns a byte stream into a bit stream. The byte stream is represented as a thunk, that produces a new byte each time it invoked. The returned bit-reader is represented as a function of one argument, the number of bits to read from the stream. The bits read are returned as an unsigned integer.

(define bit-reader (make-bit-reader (lambda () #b11000101)))
> (bit-reader 3)
6
> (bit-reader 4)


Inspired by this approach I wrote a bit-writer in the same style. Given a byte-writer the function make-bit-writer returns two values: the first is a bit-writer, a function to two arguments the number of bits to write and the actual bits, the second argument is a bit-flusher. Since the bit-writer can't call the underlying byte-writer before a whole byte is received it was necessary to introduce the bit-flusher to flush any remaining bits at the end of a file. The original code made were optimized to handle common cases such as reading a single bit or a single byte fast. An effort was made to mimic these optimizations in the writer.

Since the first version of the indexer used normal file operations such as with-output-to-file, write-byte and friends it wasn't straightforward to change the code to use the stream approach. The solution was to introduce bit-ports, which resembles normal ports. In the following examples the numbers from 1 to 8 are written to a temporary file and then read back. Each number is written using a different number of bits.

> (require (planet "bit-io.scm" ("soegaard" "bit-io.plt")))

> (with-output-to-bit-file "tmp"
(lambda ()
(do ([i 1 (+ i 1)]) [(= i 9) 'done]
(write-bits i i)))
'replace)
done

> (with-input-from-bit-file "tmp"
(lambda ()
(do ([i 1 (+ i 1)]
[ns '() (cons (read-bits i) ns)])
[(= i 9) ns])))
(8 7 6 5 4 3 2 1)



The bit-io library is available through PLaneT, the documentation and source are available at the PLT Source Browser.

Labels:

Thursday, April 20, 2006

Mini Kanren and the Da Vinci challenge

The film adaption of Dan Brown's bestseller "The Da Vinci Code" is about to be released. This prompted the marketing people of Colombia Pictures and Google to create the Google Da Vinci Quest. The quest consists of 24 puzzles to be solved. To avoid cheating each person can expect to get his own set of puzzles. According to 4-time World Puzzle Championship Individual Winner and co-creator of the quest Wei-Hwa Huang there is a total of 12,358 puzzles.


The first class of puzzle is a board puzzle. A 4x4 board needs to be filled with symbols in such a way that all symbols in each row, column and colored areas are different. The symbols given are 4 blades, 4 fleur-de-lis, 4 omegas and 4 crosses. A few symbols are filled in from the beginning and can't be moved. [The image above shows a different puzzle than the one solved below.]


The puzzle I got was easy, but in anticipation of more challenging puzzles ahead, I decided to dust of my copy of The Reasoned Schemer by Daniel P. Friedman, William E. Byrd and Oleg Kiselyov. The book teaches the logic programming as a natural extension of functional programming, and the paradigm of logic programming fits this puzzle perfectly. The book is very enjoyable, but be careful it might hurt your weight.

The logic programming system used by The Reasoned Schmer is MiniKanren. It is a much simplified version of Kanren and runs in any R5RS Scheme. Without further ado here is a MiniKanren program to solve the board puzzles from the Da Vinci Quest:

(define distinct
(lambda (x1 x2 x3 x4)
(let ([xs (list x1 x2 x3 x4)])
(all
(membero 'blade xs)
(membero 'cross xs)
(membero 'fleur xs)
(membero 'omega xs)))))

> (run* (b)
(fresh (x1 x2 x3 x4
x5 x6 x7 x8
x9 x10 x11 x12
x13 x14 x15 x16)
(alli
; the board consists of fields fields
(== b (list (list x1 x2 x3 x4)
(list x5 x6 x7 x8)
(list x9 x10 x11 x12)
(list x13 x14 x15 x16)))
(== b (list (list x1 x2 x3 x4)
(list x5 x6 x7 'blade)
(list x9 'omega x11 x12)
(list x13 'fleur x15 'cross)))
; symbols in rows are distinct
(distinct x1 x2 x3 x4)
(distinct x5 x6 x7 x8)
(distinct x9 x10 x11 x12)
(distinct x13 x14 x15 x16)
; symbols in columns are distinct
(distinct x1 x5 x9 x13)
(distinct x2 x6 x10 x14)
(distinct x3 x7 x11 x15)
(distinct x4 x8 x12 x16)
; symbols in each colored area are distinct
(distinct x3 x4 x7 x8)
(distinct x9 x13 x14 x15)
)))

(((fleur blade cross omega)
(omega cross fleur blade)
(cross omega blade fleur)
(blade fleur omega cross)))
#<procedure:succeed>

Tuesday, April 18, 2006

PLT Source Browser

The PLT Source Browser is now uptodate.

Since the last update a series of interesting stuff has appeared at PLaneT.

Douglas Williams released version 2.2 of his Science Collection. It now features a browsable html reference manual. It is loaded with examples and plots. For example a histogram of the results of 10.000 trials of summing two dice.

Noel Welsh released Instaweb which is a small tool that makes it developing and testing a single servle in the PLT web server easier.

Dave Herman released csv-write to generate output in csv-format. As a spin-off he also released mysqldump to dump a MySQL database as a csv-file.

Carl Eastlund's combinator library was updated, so was Lizorkin's sxml library and Dignatof's Commander Swift aka cdrswift.

Finally Daniel Yoo released some syntactic sugar for Ruby/Python style generators.

Sunday, April 16, 2006

Index Construction

Update: The last part were rewritten.

Sort-based index construction

The implementation of the indexer have progressed better than I expected. As a warmup to the implementation of the "sort-based multiway compressed" algorithm I have implemented both the "sort-based" and the "sort-based compressed" algorithm. They both work in four phases:
  1. For all terms in all documents a record (t, d, f) of the term number, the document number d and the frequency f of the term t in the document d is made in a temporary file.

  2. The temporary file is divided into blocks, which are sorted in-memory.

  3. A series of runs are made. Each run merges two sorted blocks. The end result is a large file consisting of (t,d,f) triples sorted in lexicographic order.

  4. Based on the (t,d,f) triples the index is constructed. For each term number t an inverted list of the form (t n d1 f1 di f2 ... dn fn) is written to the index file. Here n is the number of documents in which the term occurs; di and fi are associated document numbers and frequencies.
During the tokenizing in phase 1 a lexicon is made. The lexicon associates terms and term numbers. Furthermore during phase 4 the location of the term's inverted list in the index file is saved.

Searching for a term is simple: Use the lexicon to look up the location of the inverted list in the index file. Read the inverted list. Present the result - the frequencies are used to rank the documents after relevancy.

To keep the size of the index and the temporary file down the numbers aren't stored as standard 32 bit integers. Since an integer N requires only ceil(log(N)) bits, 32 bit integers waste a lot of bits for small numbers. Instead more flexible encodings are used in which small numbers are represented in fewer bits than larger numbers [More about bit encodings in a later post].

Sort-based index construction with compression

A standard compression trick is to use gaps to store ascending lists. Say a term is found in documents (2 6 9 13) then the corresponding gaps are (2 4 3 4). The original document numbers are found by adding the gaps, e.g. 9 = 2+4+3. The important thing to note is that the gaps are smaller than the original numbers and thus can be stored in fewer bits (if an approriate encoding is chosen).

This observation can be used to store the inverted lists, since the inverted list of a term (t n d1 f1 di f2 ... dn fn) has ascending document numbers di. That is, changing the representation of an inverted list to (t n dg1 f1 dgi f2 ... dgn fn), where dg stands for document gap, will reduce the size of the final index. The alert reader will note that the same trick applies to the term numbers in the final index. The extra alert reader will note that all terms are present in the final index and each term occurs only once, so we can omit the term number t completely from the final index.

Compressing the final index file will have relatively little effect on the total run time of the indexer. The bulk of the work is namely done in phase 3 during the external sorting of the temporary files consisting of the (t, d, f) triples. Now compressing these temporary files would reduce i/o considerably and hence also the run time. Can we use the same trick again? We sure can - the temporary files consists of blocks of triples and within each block the term numbers t occur in ascending order. The only complication of changing the representation from (t, d, f) to (tg,d,f), where tg is the term number gap, is some additional bookkeeping during merging.

The difference between the "sort-based" and the "sort-based compression" algorithm is thus that the latter compresses the temporary files.

One last observation: By fusing phase 1 and 2 a pass through the temporary file is saved.

Labels:

Thursday, April 06, 2006

PLaneT: combinators.ss

Carl Eastlund updated his library of combinator functions combinators.ss a few days ago.

It contains some classics such as curry and constant, but also includes an "applying" compose as well as map2. The map2 works the same way as map, but the mapper function must return two values.

The functions curry and constant are special cases of Sebastian Egner's cut from srfi-26. The names curry and constant are easier on the eyes though.

> (require (planet "combinators.ss" ("cce" "combinators.plt" 1 1)))

> ((curry list 1 2) 3 4)
(1 2 3 4)

> ((constant 3))
3

> (define (q+r n)
(list (quotient n 3)
(remainder n 3)))
> ((compose/apply * q+r) 8)
4

> (map2 quotient/remainder
(list 1 2 3 4 5)
(list 3 3 3 3 3))
(0 0 1 1 1)
(1 2 0 1 2)

Wednesday, April 05, 2006

Tokenizing Scheme Source

To index all words in a file with english, the first thing you need to do is to find all the words. Since I want to index Scheme source, my words are identifiers.

The default reader of MzScheme extends the syntax of R5RS-identifiers, so it is neccessary to read the fine print before one starts coding. Simplifying a little one divides the characters in two classes delimeters and non-delimeters. To find the next token, one simply keeps reading until a non-delimeter is seen. The token is then read char for char until a delimeter is seen. This simplistic approach would work if it were not for identifiers beginning with hash-percent (#%), however it doesn't take much ingenuity to handle this too.

The easiest and most efficient way to implement the above approach in PLT Scheme is to use the builtin regular expressions. Other options were to use the more flexible Perl-style regular expressions or even use parser-tools to generate a custom lexer. The builtin regular expressions have the advantage that they work directly on the input port which makes them very fast. Since PLT Scheme uses the utf-8 encoding to store source code, it is neccessary to use byte-regexps (as opposed to string-based). Another advantage of matching directly on the input port is that the port abstraction keeps track of location information such as line number and column number automatically.

  (define (utf-regexp str)
(byte-regexp (string->bytes/utf-8 str)))

(define whitespace "[ \n\r\t]")
(define brackets "[]\\[\\(\\){}")
(define punctuation "[,'`;]")
(define special "[\\|]")
(define delimeter (string-append "(" whitespace "|" punctuation "|"
brackets "|" special "|" "\"" ")"))
(define delimeter* (string-append delimeter "*"))

(define non-symbol-starter-regexp (utf-regexp delimeter*))

(define (skip-to-next-token)
(regexp-match non-symbol-starter-regexp (current-input-port))
(if (eqv? (peek-char) #\#)
(unless (equal? (peek-string 2 0) "#%")
(read-char)
(skip-to-next-token))))

(define non-delimeter "[^]\\[\\(\\){} \n\r\t,'`;\\|\"]")
(define non-delimeter* (string-append non-delimeter "*"))
(define non-delimeter*-regexp (utf-regexp non-delimeter*))

(define (read-token)
(let* ([pos (file-position (current-input-port))]
[m (regexp-match non-delimeter*-regexp (current-input-port))])
(and m
(match m
[(token . _)
(list token pos)]))))

(define (for-each-token f)
(unless (eof-object? (peek-char))
(skip-to-next-token)
(unless (eof-object? (peek-char))
(let ([token (read-token)])
(if token
(begin
(f token)
(for-each-token f))
(error "internal error: token expected after skipping"))))))

(define (for-each-token-in-file file f)
(with-input-from-file file
(lambda ()
(for-each-token f ))))

Paste Scheme now with colors

Paste Scheme now has colors. Actually I thought it always had, but I had forgotten to change the link to the stylesheet when I moved the servlet from my own machine to the web-server. Since I have a test web-server running on my own machine everything looked okay here...

Tuesday, April 04, 2006

Back of the Envelope

Managing Gigabytes identify the following key parameters of a search engine:

SymbolParameters
B Total text size
N Number of documents
n Number of distinct words
F Total number of words
f Number of index pointers
I Final size of the compressed inverted file
L Size of dynamic lexicon structure

These parameters together with estimations of disk seek time, disk transfer time per byte etc. are used to make estimations on the ressource usage of the search engine.

My initial plan is to index all source files of the PLT Source Browser (temporary url). In round numbers there are N=5200 files whose total size is 40M. Tokenizing the files reveal that the total number of words are F=3.7 million and that the number of distinct words are n=120.000.

Assuming one word entry will take up 10 bytes 10F=37MBy is needed for the entire index. For so small indexes the naïve approach (linked lists in memory) to indexing would be fine. I'll stick to my decision of using the sort-based compressed approach though. It will be able to run on smaller machines with larger data sets - and the algorithms are more interesting.

Sunday, April 02, 2006

Planet Scheme

Andreas Rottmann's aggregation of Scheme blogs have been dead for while now. Mailing him to ask whether he plans to revive his Planet Scheme didn't give a reply. As a regular reader of Planet Scheme I decided to put together my own Planet Scheme - at least until Andreas puts his online again. The software was quite easy to set up. The only remaining issues are to find a logo and to figure out why the aggregator inserts extra newlines in my nicely colored Scheme snippets.

Scouring the net for Scheme related blogs I discovered that Will Farr already used Scheming as the title of his blog - so I changed the name of this blog to Everything Scheme.

Paste Scheme

[Note: This post shows up in Planet Scheme with extra line breaks. If you know how to fix this, please send me a mail. Editing the original trying to fix the linebreak issue unfortunately tricks the Planet software to think it is a new post.]



Blogging about Scheme leads to the desire of including snippets of Scheme source. Being spoiled by the DrScheme syntax highlighting this means syntax colored snippets. Luckily Dorai Sitaram wrote some syntax coloring code for SLaTeX. This code was adapted/rewritten by Anton van Straten for the Scheme Cookbook. Using this this code I have put together a little servlet Paste Scheme which lets you submit a scheme snippet and returns XHTML ready to paste into your favorite blogging software.



The servlet source below was produced with Paste Scheme. The web.plt package haven't been submitted to PLaneT yet, since there is no documentation except for comments in the source yet.



;;; paste.ss  --  Jens Axel Soegaard

(module paste mzscheme
(provide interface-version timeout start)

(require (lib "match.ss")
(lib "kw.ss")
(planet "web.scm" ("soegaard" "web.plt"))
(planet "html.scm" ("soegaard" "web.plt"))
"scm2xexpr.scm")

;;;
;;; SERVLET INTERFACE
;;;

(define interface-version 'v1)
(define timeout 6000)

(define start
; servlet sets up the various parameters such as
; current-bindings and current-cookies, evaluates
; the body expressiosns, the last of which should
; evaluate to an xepxr.
(servlet
(report-errors-to-browser send/finish)
(html-paste-page)))

;;;
;;; VIEW
;;;

(define default-snippet "
(define (fact n)
(if (= n 0)
1
(* n (f (- n 1)))))"
)

(define (html-paste-page)
(with-binding (current-bindings) (snippet)
(let ([snippet (if snippet snippet default-snippet)])
(html-page
#:title "Paste Scheme"
#:style-sheet "http://localhost:8080/paste.css"
#:header '(h1 "Scheme Paste")
#:body `(div (h2 "Enter Snippet")
,(html-form
"submit_snippet" "paste.ss"
`(textarea ((name "snippet") (cols "80") (rows "10"))
,snippet)
'(br)
(html-input "submit" #:value "submit"))
(h2 "Preview")
,(scheme-text->xexpr snippet)
(h2 "XHTML")
(pre
,(xexpr->string
(scheme-text->xexpr snippet)))
(h2 "Stylesheet")
(pre ,"
.scheme { color: brown; margin: 4pt; } /* background punctuation */
.scheme .keyword { color: rgb(68,0,203); font-weight: bold; }
.scheme .builtin { color: navy; }
.scheme .variable { color: black; }
.scheme .global { color: purple; }
.scheme .selfeval { color: green; }
.scheme .comment { color: teal; }
"
))))))

)