SHRIPHANI PALAKODETY: Posts tagged 'k-nn-classifier'urn:http-blog-shriphani-com:-tags-k-nn-classifier-html2013-11-30T04:49:33ZA Comment on Dimension-Estimationurn:http-blog-shriphani-com:-2013-11-29-a-comment-on-dimension-estimation2013-11-30T04:49:33Z2013-11-30T04:49:33ZSHRIPHANI PALAKODETY<html>
<p>I saw this neat comment in a paper I was recently reading. If you have all <code>i.i.d</code> features and you want to estimate its dimension using Grassberger-Procaccia (which estimates dimension using a distance-based metric) or want to classify using a k-NN classifier, it is bad if the data points are mostly pairwise equidistant (for instance, a correlation integral plot will look like a step function and thus will be useless; a k-NN classifier will break because the test point ends up equidistant from all the existing points).</p>
<p>There is a trivial argument using the Hoeffding bound in Chris Burges’ <a href="http://research.microsoft.com/en-us/um/people/cburges/tech_reports/msr-tr-2009-2013.pdf">paper</a> that suggests that if the features are all <code>i.i.d</code>, a majority of pairwise distances will end up clustered tightly around a mean which means that k-NN or Grassberger-Procaccia won’t work well. I am going to repeat this argument here so I can remember it for later:</p>
<p>Our vectors are of dimension $ d $ and the components are $ \pm1 $. Assuming all the components are $ iid $, the Hoeffding bound gives us:</p>
<p>$$ P(||| x_{1} - x_{2} ||^{2} – 2d| > d\epsilon) = P(| x_{1} \cdot x_{2} | > d\epsilon/2) \le 2exp(-\frac{d\epsilon^2}{8})$$</p>
<p>and this shows us that most pairwise distances will end up clustered very tightly around a mean and this means that a majority of pairs of points in the dataset will end up equidistant and thus a $ k-NN $ classifier will fail.</p>
<p>This also means that the correlation integral is a good way to determine if a k-NN classifier will work well. If the plot resembles a spike, the distance function needs to change.</p>
<p>The correlation-integral is an immensely powerful tool and <a href="https://github.com/shriphani/clj-dimension/blob/master/src/clj_dimension/estimation/correlation_integral.clj">here’s</a> an implementation</p></html>