A Comment on Dimension-Estimation

I saw this neat comment in a paper I was recently reading. If you have all i.i.d features and you want to estimate its dimension using Grassberger-Procaccia (which estimates dimension using a distance-based metric) or want to classify using a k-NN classifier, it is bad if the data points are mostly pairwise equidistant (for instance, a correlation integral plot will look like a step function and thus will be useless; a k-NN classifier will break because the test point ends up equidistant from all the existing points).

There is a trivial argument using the Hoeffding bound in Chris Burges’ paper that suggests that if the features are all i.i.d, a majority of pairwise distances will end up clustered tightly around a mean which means that k-NN or Grassberger-Procaccia won’t work well. I am going to repeat this argument here so I can remember it for later:

Our vectors are of dimension $ d $ and the components are $ \pm1 $. Assuming all the components are $ iid $, the Hoeffding bound gives us:

$$ P(||| x_{1} - x_{2} ||^{2} – 2d| > d\epsilon) = P(| x_{1} \cdot x_{2} | > d\epsilon/2) \le 2exp(-\frac{d\epsilon^2}{8})$$

and this shows us that most pairwise distances will end up clustered very tightly around a mean and this means that a majority of pairs of points in the dataset will end up equidistant and thus a $ k-NN $ classifier will fail.

This also means that the correlation integral is a good way to determine if a k-NN classifier will work well. If the plot resembles a spike, the distance function needs to change.

The correlation-integral is an immensely powerful tool and here’s an implementation


The Story of Phineas Gage

I first encountered the story of Phineas Gage in my freshman year in college. Phineas had a projectile drill a hole in his head (while he worked the railroad) and he survived the incident. The incident caused a massive personality change in Phineas and conclusively ruled out phrenology as a reasonable model of how the brain worked.

After the incident Phineas’ life comprised a trip to South America, a gig at P.T. Barnum’s establishment and a massive change in our understanding of the brain. This wikipedia article covers a good chunk of this work.


Obituary

Earlier this month, my paternal grandfather passed away. I didn’t speak to him before his death since I expected him to recover from his current bout of ill-health and I have since carried some guilt thanks to this.

I think this excerpt from a Pink Floyd song best summarizes his life:

Run, run rabbit run
Dig that hole, forget the sun
And when at last the work is done
Don’t sit down, it’s time to dig another one

-- [Breathe, Pink Floyd](http://rock.rapgenius.com/Pink-floyd-breathe-lyrics)

RIP.


Web Crawling - Dos and Don’ts

For my SIGIR submission I have been working on finding efficient traversal strategies while crawling websites.

Web crawling is a straightforward graph-traversal problem. My research focuses on discarding unproductive paths and preserving bandwidth to find more information. I will write a post on it once I have my ideas fleshed out and thus that won’t be the focus of this post.

Here, I will describe the finer details needed to make your crawler polite and robust. An impolite crawler will incur the wrath of an admin and might get you banned. A crawler that isn’t robust cannot survive the onslaught of quirks that the WWW is full of.


Zephyros Racket API

In the recent past, I wanted to control the OS X window manager from racket like I could on Linux using the X11 library. I found a very sweet Github project called zephyros that implemented a large number of vital routines (vital for managing windows anyway) and provided a simple protocol using json. Since it would be convenient to have a racket module, I wrote a wrapper around it.


The Percolator Paper

In the IR reading group this week I decided to read the Percolator paper from Google[1]. It caused quite a stir on several news-reading sites after a Google Research blog-post on the topic. Since I’ve never had the chance to read it, this is as good a time as any. This is not a comprehensive summary at all and lots of results here are hand-wavy. If you want to instruct yourself, please read the paper.


Pittsburgh Vintage Grand Prix - Italian Cars

In the Pittsburgh Vintage Grand Prix [1], Ferrari had a large exhibit in celebration of their 50th year in America and a few Lamborghinis, Alfas and Maseratis showed up. Since I am not likely to ever own a Ferrari, I behaved like a tourist and took a few pictures. Some of these vehicles were incredibly well maintained. You can see the entire album by following the link below

[1] The Pittsburgh Vintage Grand Prix go back


Accessing Your Kindle Highlights

In 2010, I purchased my first Kindle and since then apart from GEB [1], I haven’t bothered with physical copies. The Kindle store satisfies most of my needs (I find situations where the paperback costs less than the digital copy and refuse to buy the book on principle).

The books can be read on any platform (OS X, iOS for iPad and iPhone in my case and I do remember a rather unpleasant Kindle app on WP7)

One of the benefits of a digital book is that it should be straightforward for me to collect a list of highlights I’ve made about the book. Amazon (in their infinite wisdom) have not provided an API in the 3 or so years I’ve used the Kindle ecosystem and manually transcribing the quotes is not something I am interested in doing. Scraping remains the only alternative. I decided to use clojure for this task.



Per Intellectum, Vis
(c) Shriphani Palakodety 2013-2016