## Dimension Analysis: A Recap

In the past few blog posts, I covered some details of popular dimension-reduction techniques and showed some common themes. In this post, I will collect all these materials and tie them together.

In the past few blog posts, I covered some details of popular dimension-reduction techniques and showed some common themes. In this post, I will collect all these materials and tie them together.

*This is part of a series on a family of dimension-reduction algorithms called non-linear dimension reduction. The goal here is to reduce the dimensions of a dataset (i.e. discard some columns in your data)*

In previous posts, I discussed the MDS algorithm and presented some key ideas. In this post, I will describe how those ideas are leveraged in the Isomap algorithm. A clojure implementation based on core.matrix is also included.

In a previous post, I described the MDS (multidimensional scaling) algorithm. This algorithm operates on a proximity matrix which is a matrix of distances between the points in a dataset. From this matrix, a configuration of points is retrieved in a lower dimension.

The MDS strategy is:

- We have a matrix $ D $ for distances between points in the data. This matrix is
*symmetric*. - We express distances as dot-products (using a proof from Schonberg). This means that $ D $ is expressed as $ X^T X $. (Observe that $ X^T X $ is a matrix of dot-products).
- Once we have $ X^T X $, dimension reduction is trivial. Running an eigendecomposition on this matrix will produced centered coordinates. The low-dimension embedding is recovered by discarding eigenvalues (eigenvectors).

Thus, assuming that we work with *euclidean distances* between points, we retrieve an embedding that PCA itself would produce. Thus, **MDS with euclidean distances is identical to PCA**.

Then what exactly is the value of running MDS on a dataset?

First, the PCA is not the most powerful approach. For certain datasets, euclidean distances do not capture the shape of the underlying manifold. Running the steps of the MDS on a different distance matrix (at least one that doesn’t contain euclidean distances) can lead to better results - a technique that the Isomap algorithm exploits.

Second, the PCA requires a vector-representation for points. In several situations, the objects in the dataset are not points in a metric space (like strings). We can retrieve distances between objects (say edit-distance for strings) and then obtain a vector-representation for the objects using MDS.

In the next blog post, I will describe and implement the Isomap algorithm that leverages the ideas in the MDS strategy. Isomap constructs a distance matrix that attempts to do a better job at recovering the underlying manifold.

**PROOFS:**

*Distances are dot-products:*These notes from Peking U are easy to follow. I have mirrored them here in case that link 404s.

I spent the last few months studying and implementing some routines that take a raw HTML document (or documents) and do stuff with it (them). Subotai is a library that consolidates some of these routines. In this blog post I will describe what is currently implemented and what the roadmap is.

(c) Shriphani Palakodety 2013-2016