Whistlepig is a lightweight real-time search engine written in ANSI C. (description and source) I heard about it when Don Metzler plugged it in an answer he wrote on quora. In this post, with very little code, I was able to build an index, query it and write a servlet that talks to the index using the FFI.
Whistlepig is implemented in less than 3000 lines of ANSI C. It comes with a very decent query language and returns documents (that match a query) sorted by their time-of-insertion (into the index).
For using the FFI, we need a shared library. I had to modify the Makefile slightly to make it compile on OS X (you can download this from my fork here). I was able to build it on Linux with ease though. You can build the .so
(.dylib
) file using the following command:
On OS X, you will need to replace the .so
with .dylib
.
In order for the generated libwhistlepig.so
file to be picked up the Racket FFI, you will need to add the directory where it resides to the environment variable LD_LIBRARY_PATH
.
Next, we need to use the FFI and write racket functions that call the corresponding Whistlepig routines. This file is sufficient to wrap around all this routines we will need to use.
Let us now test our implementation. Whistlepig itself ships with two programs: add
and query
. add
adds a bunch of files specified on the command line to a new index. We can replicate the functionality in add.rkt
:
interactive.rkt
takes and index location and interactively runs queries against it and returns doc-ids. This is interactive.rkt
.
Now, the next step is obviously very straightforward. I wanted a quick way to get set up and running. We don’t have things like stored-fields in Lucene so, we need an external map to doc-ids. I am using a file that contains a list of s-expressions
that look like this:
(doc-id doc-path doc-title doc-link)
On my laptop (not accessible from the public web), the file looks like:
So, our search engine will start, load the documents and add them to the index in the order specified.
When a query comes along, we get the doc-ids from the engine and line them up with the document title (and thus we get the doc-link that can be rendered).
We accept queries using a URL of the form http://domain/search?q=query
. This tiny servlet accomplishes that:
You can try it out here : http://blog.shriphani.com/search?q=hello. It is very clunky and only indexes my blog’s post pages. . It won’t generate snippets (the postings list doesn’t store token positions).
To make this really work, I will need to integrate it with Frog and flesh the UI out. This was something I threw together quickly so I could build on it later. It ended up being a good excuse to continue using the FFI. The whistlepig bindings are quite sparse too and can use some work.
The full code is available in this Github repository.