The Smallest Eigenvalues of a Graph Laplacian

Given a graph $ G = (V, E) $, its adjacency matrix $ A $ contains an entry at $ A_{ij} $ if vertices $ i $ and $ j $ have an edge between them. The degree matrix $ D $ contains the degree of each vertex along its diagonal.

The graph laplacian of $ G $ is given by $ D - A $.

Several popular techniques leverage the information contained in this matrix. This blog post focuses on the two smallest eigenvalues.

First, we look at the eigenvalue 0 and its eigenvectors. A very elegant result about its multiplicity forms the foundation of spectral clustering.

Then we look at the second smallest eigenvalue and the corresponding eigenvector. A slightly more involved result (YMMV) allows us to partition the graph in question. A recent publication by John Urschel (who apparently moonlights as a sportsperson) focused on this quantity.

The insights provided here sacrifice some rigor for the sake of brevity. I find such descriptions help me study without getting bogged down too much with details. A bibliography provided at the end contains links to actual proofs.

Multidimensional Scaling and PCA are the Same Thing

There is a very simple argument that shows that MDS and PCA achieve the same results.

This argument has 2 important components. The first of these shows that an eigendecomposition of a gram matrix can be used for dimension-reduction.

PCA leverages the singular value decomposition (SVD). Given a matrix $ X $, the SVD is $ X = USV^{T} $.

We drop columns from X by using $ US_{t} $ where we drop some rows and columns from S.

This is also conviently obtained using an eigendecomposition of the covariance matrix.

Working with the gram matrix, we have $ XX^{T} $ and when expressed in terms of $ U $, $ S $ and $ V $, we have $ XX^{T} $ = $ (USV^{T})(VSU^{T}) $.

Simple algebra tells us that this is equal to $ US^{2}U^{T} $. The spectral theorem tells us that this the eigendecomposition of the gram matrix will return this decomposition. $ U $ and $ S $ can be retrieved and a dataset with fewer dimensions can be obtained.

The second part of the argument involves proving that a matrix of distances is indeed a gram matrix. This argument was discussed in a previous post.

Implementing Truncated Matrix Decompositions for Core.Matrix

Eigendecompositions and Singular Value Decompositions appear in a variety of settings in machine learning and data mining. The eigendecomposition looks like so:

$$ \mathbf{A}=\mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{-1} $$

$ \mathbf{Q} $ contains the eigenvectors of $ \mathbf{A} $ and $ \mathbf{\Lambda} $ is a diagonal matrix containing the eigenvalues.

The singular value decomposition looks like:

$$ \mathbf{A} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^* $$

$ \mathbf{U} $ contains the eigenvectors of the covariance matrix $ \mathbf{A}\mathbf{A^T} $. $ \mathbf{V} $ contains the eigenvectors of the gram matrix $ \mathbf{A^T}\mathbf{A} $.

The truncated variants of these decompositions allow us to compute only a few eigenvalues(vectors) or singular values (vectors).

This is important since (i) a lot of times, the smaller eigenvalues are discarded, and (ii) you don’t want to compute the entire decomposition and retain only a few of the rows and columns of the computed matrices each time.

For core.matrix, I implemented these truncated decompositions in Kublai. Details below.

The Isomap Algorithm

This is part of a series on a family of dimension-reduction algorithms called non-linear dimension reduction. The goal here is to reduce the dimensions of a dataset (i.e. discard some columns in your data)

In previous posts, I discussed the MDS algorithm and presented some key ideas. In this post, I will describe how those ideas are leveraged in the Isomap algorithm. A clojure implementation based on core.matrix is also included.

Powerful Ideas in Manifold Learning

In a previous post, I described the MDS (multidimensional scaling) algorithm. This algorithm operates on a proximity matrix which is a matrix of distances between the points in a dataset. From this matrix, a configuration of points is retrieved in a lower dimension.

The MDS strategy is:

  • We have a matrix $ D $ for distances between points in the data. This matrix is symmetric.
  • We express distances as dot-products (using a proof from Schonberg). This means that $ D $ is expressed as $ X^T X $. (Observe that $ X^T X $ is a matrix of dot-products).
  • Once we have $ X^T X $, dimension reduction is trivial. Running an eigendecomposition on this matrix will produced centered coordinates. The low-dimension embedding is recovered by discarding eigenvalues (eigenvectors).

Thus, assuming that we work with euclidean distances between points, we retrieve an embedding that PCA itself would produce. Thus, MDS with euclidean distances is identical to PCA.

Then what exactly is the value of running MDS on a dataset?

First, the PCA is not the most powerful approach. For certain datasets, euclidean distances do not capture the shape of the underlying manifold. Running the steps of the MDS on a different distance matrix (at least one that doesn’t contain euclidean distances) can lead to better results - a technique that the Isomap algorithm exploits.

Second, the PCA requires a vector-representation for points. In several situations, the objects in the dataset are not points in a metric space (like strings). We can retrieve distances between objects (say edit-distance for strings) and then obtain a vector-representation for the objects using MDS.

In the next blog post, I will describe and implement the Isomap algorithm that leverages the ideas in the MDS strategy. Isomap constructs a distance matrix that attempts to do a better job at recovering the underlying manifold.


A Comment on Dimension-Estimation

I saw this neat comment in a paper I was recently reading. If you have all i.i.d features and you want to estimate its dimension using Grassberger-Procaccia (which estimates dimension using a distance-based metric) or want to classify using a k-NN classifier, it is bad if the data points are mostly pairwise equidistant (for instance, a correlation integral plot will look like a step function and thus will be useless; a k-NN classifier will break because the test point ends up equidistant from all the existing points).

There is a trivial argument using the Hoeffding bound in Chris Burges’ paper that suggests that if the features are all i.i.d, a majority of pairwise distances will end up clustered tightly around a mean which means that k-NN or Grassberger-Procaccia won’t work well. I am going to repeat this argument here so I can remember it for later:

Our vectors are of dimension $ d $ and the components are $ \pm1 $. Assuming all the components are $ iid $, the Hoeffding bound gives us:

$$ P(||| x_{1} - x_{2} ||^{2} – 2d| > d\epsilon) = P(| x_{1} \cdot x_{2} | > d\epsilon/2) \le 2exp(-\frac{d\epsilon^2}{8})$$

and this shows us that most pairwise distances will end up clustered very tightly around a mean and this means that a majority of pairs of points in the dataset will end up equidistant and thus a $ k-NN $ classifier will fail.

This also means that the correlation integral is a good way to determine if a k-NN classifier will work well. If the plot resembles a spike, the distance function needs to change.

The correlation-integral is an immensely powerful tool and here’s an implementation

Per Intellectum, Vis
(c) Shriphani Palakodety 2013-2016