For my SIGIR submission I have been working on finding efficient traversal strategies while crawling websites.
Web crawling is a straightforward graph-traversal problem. My research focuses on discarding unproductive paths and preserving bandwidth to find more information. I will write a post on it once I have my ideas fleshed out and thus that won’t be the focus of this post.
Here, I will describe the finer details needed to make your crawler polite and robust. An impolite crawler will incur the wrath of an admin and might get you banned. A crawler that isn’t robust cannot survive the onslaught of quirks that the WWW is full of.
In the recent past, I wanted to control the OS X window manager from racket like I could on Linux using the X11 library. I found a very sweet Github project called zephyros that implemented a large number of vital routines (vital for managing windows anyway) and provided a simple protocol using json. Since it would be convenient to have a racket module, I wrote a wrapper around it.
Whistlepig is a lightweight real-time search engine written in ANSI C. (description and source) I heard about it when Don Metzler plugged it in an answer he wrote on quora. In this post, with very little code, I was able to build an index, query it and write a servlet that talks to the index using the FFI.
I recently gave a talk at CMU on the state of the Clueweb12++ crawl. Here are the slides.
In the IR reading group this week I decided to read the Percolator paper from Google. It caused quite a stir on several news-reading sites after a Google Research blog-post on the topic. Since I’ve never had the chance to read it, this is as good a time as any. This is not a comprehensive summary at all and lots of results here are hand-wavy. If you want to instruct yourself, please read the paper.
In the Pittsburgh Vintage Grand Prix , Ferrari had a large exhibit in celebration of their 50th year in America and a few Lamborghinis, Alfas and Maseratis showed up. Since I am not likely to ever own a Ferrari, I behaved like a tourist and took a few pictures. Some of these vehicles were incredibly well maintained. You can see the entire album by following the link below
 The Pittsburgh Vintage Grand Prix go back
In 2010, I purchased my first Kindle and since then apart from GEB , I haven’t bothered with physical copies. The Kindle store satisfies most of my needs (I find situations where the paperback costs less than the digital copy and refuse to buy the book on principle).
The books can be read on any platform (OS X, iOS for iPad and iPhone in my case and I do remember a rather unpleasant Kindle app on WP7)
One of the benefits of a digital book is that it should be straightforward for me to collect a list of highlights I’ve made about the book. Amazon (in their infinite wisdom) have not provided an API in the 3 or so years I’ve used the Kindle ecosystem and manually transcribing the quotes is not something I am interested in doing. Scraping remains the only alternative. I decided to use clojure for this task.
The Clueweb12++ crawl aims at accumulating social media content from the Clueweb crawl’s time frame. Our pipeline thus far was as follows:
- Download a bunch of index pages from forums (index pages link to threads).
- Identify posts that fall in the time-frame specified.
- Download posts and recreate web-graph to give the impression of a crawl completed in the 2012 time-frame.
There is one complicated time-frame in this setup - step 2. Dates processing is a nuisance that I would not wish upon anyone else. There are an innumerable number of surface representations (that can be ambiguous) and to add to our troubles, people do stuff like use “Last Week” to indicate time of activity.
The most accurate tool is SUTime but on a crawl the size of ClueWeb, it is foolish to run such a crawl on it. So what we do is use Natty. Natty is fast and reasonably accurate.
I’ve uploaded a java module to github that will spit out a list of dates. You can obtain it here.
This is my new blogging spot. The plan is to stop blogging from a WYSIWYG editor in the browser that isn’t really applicable to the kind of blogging I want to do and to move to a static generator