SHRIPHANI PALAKODETY: Posts tagged 'isomap'

Dimension Analysis: A Recap

Thu, 22 Jan 2015 03:31:29 UT

In the past few blog posts, I covered some details of popular dimension-reduction techniques and showed some common themes. In this post, I will collect all these materials and tie them together.

Dimension?

The best definition I’ve seen for the topic comes from Benoit Mandelbrot’s work on fractal geometry. The fractal dimension is associated with the ability of a pattern to fill space. Here’s a good example to illustrate what we mean.

Consider a curve viewed at three different scales (image stolen from Chris Burges’s document on dimension reduction):

Now, at a microscopic level, we begin observing. How do we observe? Assume that a there’s a sphere around the observer. Now, let this sphere expand a bit. At the microscopic level, your sphere encounters more of the curve’s material in 2 dimensions. This is illustrated in the rightmost figure.

Now, at a slightly different scale, when our sphere expands, we observe more material along just 1 dimension. This is illustrated in the middle figure.

On a scale like the one in the leftmost pic, we encounter no material at all. This is akin to a zero-dimension figure (a point).

An intuitive explanation of why scale matters is provided in this Wikipedia example. Using a ruler of different lengths, we obtain different measures for the coastline of Great Britain. At various levels of scale, we acquire various measures of the coastline - using a ruler that is as long as the diameter of the earth, the coastline of britain is a negligible fraction of our instrument.

There’s a neat formula that can be used to estimate the fractal dimension of a dataset:

$ n $: The number of pairs of points in our data.
$ r $: The radius of a sphere centered around the observer.
$ p $: The number of pairs of points in a sphere of radius $ r $.

The estimate of the fractal dimension is given by the slope of $ \log(p) $ vs $ \log(r) $.

For the curve in the example above, this value is some real number between 1 and 2 (so the points on the curve have more freedom than those on a line but less freedom than those on a 2D-plane).

A First Stab at Dimension Reduction

Working with intuitions we developed in the first section, we can develop a greedy algorithm:

Estimate the fractal-dimension of the dataset.
Choose a dimension (column) to drop, drop it and recompute the fractal-dimension. If the dimension doesn’t change too much (stays within a certain tolerance), consider this dimension dropped.
Repeat until no more dimensions can be dropped without significantly altering the fractal-dimension.

This is the Grassberger-Procaccia algorithm.

It is intuitive to grasp.

However, in a high-dimension setting, our technique for estimating fractal dimensions falls apart.

In a high-dimensional setting, pairwise distances between the points in a dataset are tightly clustered about a mean. Essentially, the points seem to be equidistant from each other. A Hoeffding bound is provided in this blog post that illustrates this point.

The PCA

One common attempt at reducing dimensions is capturing directions of maximum variance. The PCA projects points in the dataset along the eigenvectors of the covariance matrix. Since this technique is well-known, I’ll just point to this Wikipedia article.

From Proximities to Datasets

A family of techniques I like a lot operate on proximity matrices. A proximity matrix is a symmetric matrix containing similarity scores between the points in a dataset (thus this matrix contains $ n $ rows and $ n $ columns where $ n $ is the number of points in the dataset).

A simple argument demonstrates that proximity matrices are gram matrices (a gram matrix is a close cousin of the covariance matrix). One can retrieve a collection of points for a given gram matrix - see this blog post for a proof.

This family of techniques formulates the dimension-reduction problem as such: “Find a configuration of points in a lower-dimensional place that preserves the proximities in the proximity matrix”.

The standard MDS algorithm uses euclidean distances between points to populate the proximity matrix. This blog post contains more info about this algorithm.

A variant of this algorithm uses path-weights in a $k$-NN graph. This is the Isomap algorithm - covered in this post.

The Kernel Trick

The Kernel trick is leveraged in settings where we transform our points to a higher-dimensional space to make the desired insight pop out. This desired insight is a hyperplane to separate two different classes when working with a classifier. In Kernel PCA, the desired insight is capturing variances so you can run a PCA on the newer dataset in a higher-dimension.

Interestingly, MDS and Isomap are all variants of the Kernel PCA - a topic explored in this blog post.

Up Next

In future blog posts, I will discuss scaling issues with spectral algorithms, insights that can be transferred to other domains and so on.

The Isomap Algorithm

Wed, 12 Nov 2014 21:24:54 UT

This is part of a series on a family of dimension-reduction algorithms called non-linear dimension reduction. The goal here is to reduce the dimensions of a dataset (i.e. discard some columns in your data)

In previous posts, I discussed the MDS algorithm and presented some key ideas. In this post, I will describe how those ideas are leveraged in the Isomap algorithm. A clojure implementation based on core.matrix is also included.

Intuition

Isomap uses the same core ideas as the MDS algorithm:

Obtain a matrix of proximities (distances between points in a dataset).
This distance matrix is a matrix of inner products.
An eigendecomposition of this matrix gives us the lower dimension embedding.

Isomap differs from MDS in one vital way - the construction of the distance matrix. In MDS, the distance between two points is just the euclidean distance.

In Isomap, the distances between points are the weight of the shortest path in a point-graph.

The point graph is constructed by placing an edge between two points if the euclidean distance between them falls under a certain threshold or between a point and its top $ k $ neighbors.

This distance matrix captures the underlying manifold more accurately than one constructed using euclidean distances. The following toy example demonstrates this:

The data shown here looks like a swirl that starts at point 1 and ends at point 9. We would like to recover this phenomenon in our lower-dimension embedding.

The first step is to build a distance matrix. Say we use euclidean distances between two points as the corresponding entry in the distance matrix.

In this figure, it is clear that euclidean_distance(1, 6) = euclidean_distance(1, 8) and euclidean_distance(1, 5) = euclidean_distance(1, 9).

Clearly the distances computed here miss the “swirl” in the data entirely. Working with the point graph mentioned above helps us get around this problem.

Let us build a point graph by adding an edge between a node and its nearest neighbor (so $ 1-NN $). The weight on the edge is the euclidean distance between the nodes. The distance between two points is the weight of the shortest path between these points. The point graph is shown below:

Observe that when we use this newer distance metric, distance(1, 6) is indeed less than distance(1, 8) and distance(1, 5) is indeed less than distance(1, 9)

This distance function is clearly doing a better job of capturing the “swirl” in the data.

The Isomap algorithm uses a distance matrix constructed like this in place of one constructed with euclidean distances. This distance matrix is then plugged into the MDS framework and an eigendecomposition is run on the double-centered matrix.

Implementation

Let us do a clojure implementation.

We have a point-set as a core.matrix matrix. First, we compute the point-graph. I am going to place edges between a point and its 3 nearest neighbors (so $ 3-NN $). This routines expects a map of the type {point-index point-vector, …}

(defn build-point-graph
  "A point graph is a k-NN graph. Edges between
   a point and its 3 nearest neigbors"
  ([indexed-points]
     (build-point-graph indexed-points 3))

  ([indexed-points num-neighbors]
     (reduce
      (fn [acc pt]
        (let [other-points (filter
                            (fn [x]
                              (not= (first x)
                                    (first pt)))
                            indexed-points)]
          (merge
           acc
           {(first pt) (map
                        first
                        (take num-neighbors
                              (sort-by
                               #(distance (first pt)
                                          (first %))
                               other-points)))})))
      {}
      indexed-points)))

Then a simple Floyd Warshall algorithm implementation that computes the weights on the shortest paths. It takes the graph built in the previous step and the original indexed points and builds the graph.

(defn floyd-warshall-distance
  "Expected graph representation:
    {V -> neighboring-points}"
  [a-graph indexed-points]
  (let [indexed-points-dict (into {} indexed-points)
        edges    (reduce
                  (fn [acc [x neighbors]]
                    (concat acc (map (fn [n] [x n])
                                     neighbors)))
                  []
                  a-graph)

        inf-matrix (+ Double/POSITIVE_INFINITY
                      (zero-matrix (count indexed-points)
                                   (count indexed-points)))

        zero-diag  (reduce
                    (fn [acc i]
                      (mset acc i i 0))
                    inf-matrix
                    (-> indexed-points count range))

        weights-init (reduce
                      (fn [acc [x y]]
                        (mset acc
                              x
                              y
                              (distance
                               (indexed-points-dict x)
                               (indexed-points-dict y))))
                      zero-diag
                      edges)]

    (reduce
     (fn [old-distances [k i j]]
       (if (< (+ (mget old-distances i k)
                 (mget old-distances k j))
              (mget old-distances i j))
         (mset old-distances
               i
               j
               (+ (mget old-distances i k)
                  (mget old-distances k j)))
         old-distances))
     weights-init
     (for [k (-> indexed-points count range)
           i (-> indexed-points count range)
           j (-> indexed-points count range)]
       [k i j]))))

Once we have a distance matrix, we can simply feed it to MDS:

(defn isomap
  "Takes indexed-points and the target dimension"
  [points n]
  (let [indexed-points (map-indexed (fn [i x] [i x]) points)
        graph (build-point-graph indexed-points)
        distances (floyd-warshall-distance graph indexed-points)]
    (mds/distances->points distances n)))

And that’s it!

Examples

I will use word-vectors from word2vec for these 10 words: river lake city town actor doctor dog cat animal home

The word vectors for these words are available in foo.csv.

Let us reduce these to two dimensions. We get:

ISOMAP Embeddings

The embeddings produced by the MDS algorithm are:

MDS Embeddings

Compared to the plot produced by MDS, we have more separation between terms - for instance cat and dog are place close by but they don’t overlap (unlike the MDS plot). This is a qualitative analysis, it is pretty hard to gauge which embedding is better.

Full Source

See this repo: https://github.com/shriphani/clojure-manifold

Powerful Ideas in Manifold Learning

Sun, 02 Nov 2014 10:10:47 UT

In a previous post, I described the MDS (multidimensional scaling) algorithm. This algorithm operates on a proximity matrix which is a matrix of distances between the points in a dataset. From this matrix, a configuration of points is retrieved in a lower dimension.

The MDS strategy is:

We have a matrix $ D $ for distances between points in the data. This matrix is symmetric.
We express distances as dot-products (using a proof from Schonberg). This means that $ D $ is expressed as $ X^T X $. (Observe that $ X^T X $ is a matrix of dot-products).
Once we have $ X^T X $, dimension reduction is trivial. Running an eigendecomposition on this matrix will produced centered coordinates. The low-dimension embedding is recovered by discarding eigenvalues (eigenvectors).

Thus, assuming that we work with euclidean distances between points, we retrieve an embedding that PCA itself would produce. Thus, MDS with euclidean distances is identical to PCA.

Then what exactly is the value of running MDS on a dataset?

First, the PCA is not the most powerful approach. For certain datasets, euclidean distances do not capture the shape of the underlying manifold. Running the steps of the MDS on a different distance matrix (at least one that doesn’t contain euclidean distances) can lead to better results - a technique that the Isomap algorithm exploits.

Second, the PCA requires a vector-representation for points. In several situations, the objects in the dataset are not points in a metric space (like strings). We can retrieve distances between objects (say edit-distance for strings) and then obtain a vector-representation for the objects using MDS.

In the next blog post, I will describe and implement the Isomap algorithm that leverages the ideas in the MDS strategy. Isomap constructs a distance matrix that attempts to do a better job at recovering the underlying manifold.

PROOFS:

Distances are dot-products: These notes from Peking U are easy to follow. I have mirrored them here in case that link 404s.