Web Crawling - Dos and Don’ts

For my SIGIR submission I have been working on finding efficient traversal strategies while crawling websites.

Web crawling is a straightforward graph-traversal problem. My research focuses on discarding unproductive paths and preserving bandwidth to find more information. I will write a post on it once I have my ideas fleshed out and thus that won’t be the focus of this post.

Here, I will describe the finer details needed to make your crawler polite and robust. An impolite crawler will incur the wrath of an admin and might get you banned. A crawler that isn’t robust cannot survive the onslaught of quirks that the WWW is full of.

Zephyros Racket API

In the recent past, I wanted to control the OS X window manager from racket like I could on Linux using the X11 library. I found a very sweet Github project called zephyros that implemented a large number of vital routines (vital for managing windows anyway) and provided a simple protocol using json. Since it would be convenient to have a racket module, I wrote a wrapper around it.

The Percolator Paper

In the IR reading group this week I decided to read the Percolator paper from Google[1]. It caused quite a stir on several news-reading sites after a Google Research blog-post on the topic. Since I’ve never had the chance to read it, this is as good a time as any. This is not a comprehensive summary at all and lots of results here are hand-wavy. If you want to instruct yourself, please read the paper.

Pittsburgh Vintage Grand Prix - Italian Cars

In the Pittsburgh Vintage Grand Prix [1], Ferrari had a large exhibit in celebration of their 50th year in America and a few Lamborghinis, Alfas and Maseratis showed up. Since I am not likely to ever own a Ferrari, I behaved like a tourist and took a few pictures. Some of these vehicles were incredibly well maintained. You can see the entire album by following the link below

[1] The Pittsburgh Vintage Grand Prix go back

Accessing Your Kindle Highlights

In 2010, I purchased my first Kindle and since then apart from GEB [1], I haven’t bothered with physical copies. The Kindle store satisfies most of my needs (I find situations where the paperback costs less than the digital copy and refuse to buy the book on principle).

The books can be read on any platform (OS X, iOS for iPad and iPhone in my case and I do remember a rather unpleasant Kindle app on WP7)

One of the benefits of a digital book is that it should be straightforward for me to collect a list of highlights I’ve made about the book. Amazon (in their infinite wisdom) have not provided an API in the 3 or so years I’ve used the Kindle ecosystem and manually transcribing the quotes is not something I am interested in doing. Scraping remains the only alternative. I decided to use clojure for this task.

Fast dates parser

The Clueweb12++ crawl aims at accumulating social media content from the Clueweb crawl’s time frame. Our pipeline thus far was as follows:

  1. Download a bunch of index pages from forums (index pages link to threads).
  2. Identify posts that fall in the time-frame specified.
  3. Download posts and recreate web-graph to give the impression of a crawl completed in the 2012 time-frame.

There is one complicated time-frame in this setup - step 2. Dates processing is a nuisance that I would not wish upon anyone else. There are an innumerable number of surface representations (that can be ambiguous) and to add to our troubles, people do stuff like use “Last Week” to indicate time of activity.

The most accurate tool is SUTime but on a crawl the size of ClueWeb, it is foolish to run such a crawl on it. So what we do is use Natty. Natty is fast and reasonably accurate.

I’ve uploaded a java module to github that will spit out a list of dates. You can obtain it here.

New Blog

This is my new blogging spot. The plan is to stop blogging from a WYSIWYG editor in the browser that isn’t really applicable to the kind of blogging I want to do and to move to a static generator

Per Intellectum, Vis
(c) Shriphani Palakodety 2013-2016