In the IR reading group this week I decided to read the Percolator paper from Google. It caused quite a stir on several news-reading sites after a Google Research blog-post on the topic. Since I’ve never had the chance to read it, this is as good a time as any. This is not a comprehensive summary at all and lots of results here are hand-wavy. If you want to instruct yourself, please read the paper.
In the Pittsburgh Vintage Grand Prix , Ferrari had a large exhibit in celebration of their 50th year in America and a few Lamborghinis, Alfas and Maseratis showed up. Since I am not likely to ever own a Ferrari, I behaved like a tourist and took a few pictures. Some of these vehicles were incredibly well maintained. You can see the entire album by following the link below
In 2010, I purchased my first Kindle and since then apart from GEB , I haven’t bothered with physical copies. The Kindle store satisfies most of my needs (I find situations where the paperback costs less than the digital copy and refuse to buy the book on principle).
The books can be read on any platform (OS X, iOS for iPad and iPhone in my case and I do remember a rather unpleasant Kindle app on WP7)
One of the benefits of a digital book is that it should be straightforward for me to collect a list of highlights I’ve made about the book. Amazon (in their infinite wisdom) have not provided an API in the 3 or so years I’ve used the Kindle ecosystem and manually transcribing the quotes is not something I am interested in doing. Scraping remains the only alternative. I decided to use clojure for this task.
The Clueweb12++ crawl aims at accumulating social media content from the Clueweb crawl’s time frame. Our pipeline thus far was as follows:
- Download a bunch of index pages from forums (index pages link to threads).
- Identify posts that fall in the time-frame specified.
- Download posts and recreate web-graph to give the impression of a crawl completed in the 2012 time-frame.
There is one complicated time-frame in this setup - step 2. Dates processing is a nuisance that I would not wish upon anyone else. There are an innumerable number of surface representations (that can be ambiguous) and to add to our troubles, people do stuff like use “Last Week” to indicate time of activity.
The most accurate tool is SUTime but on a crawl the size of ClueWeb, it is foolish to run such a crawl on it. So what we do is use Natty. Natty is fast and reasonably accurate.
I’ve uploaded a java module to github that will spit out a list of dates. You can obtain it here.
This is my new blogging spot. The plan is to stop blogging from a WYSIWYG editor in the browser that isn’t really applicable to the kind of blogging I want to do and to move to a static generator