Switching Gigs
Kimono Labs and I have parted ways; I will be working on problems at a stealth startup starting Monday.
Kimono Labs and I have parted ways; I will be working on problems at a stealth startup starting Monday.
Nutch and Heritrix are battle-tested web-crawlers. ClueWeb9, ClueWeb12 and the Common-Crawl corpora employed one of these.
Toy crawlers that hold important data-structures in memory fail spectacularly when downloading a large number of pages. Heritrix and Nutch benefit from several man-years of work aimed at stability and scalability.
In a previous project, I wanted to leverage Heritrix’s infrastructure and the flexibility to implement some custom components in Clojure. For instance, being able to extract certain links based on the output of a classifier. Or being able to use simple enlive
selectors.
The solution I used was to expose the routines I wanted via a web-server and have Heritrix request these routines.
This allowed me to use libraries like enlive
that I am comfortable with and still avail the benefits of the infra Heritrix provides.
What follows is a library - sleipnir, that allows you to do all this in a simple way.
In the past few blog posts, I covered some details of popular dimension-reduction techniques and showed some common themes. In this post, I will collect all these materials and tie them together.
The Kernel PCA is an extension of the PCA algorithm. In particular, we desire to (i) transform our existing dataset to another high-dimensional space and then (ii) perform PCA on the data in that space.
In this blog post, I will perform a very quick, non-rigorous overview of Kernel PCA and demonstrate some connections between other forms of dimension-reduction.
Why would this be useful? Consider this dataset (stolen from the scikit-learn documentation):
As we can see, the original PCA projection is fairly useless. Applying a kernel produces a much better projection.
Like kernels in several problems, the trick is to avoid transforming data-points and leveraging dot-products.
The technique is as follows:
$ \Phi $ is a mapping from our existing point-space to a higher-dimensional space $ \mathcal{F} $.
After a certain amount of linear algebra, the PCA in space $ \mathcal{F} $ can be expressed as a PCA on the kernel matrix.
So, the algorithm is expressible as follows:
Compute the kernel matrix $ K $ where $ K_{ij} = \Phi(x_i) \cdot \Phi(x_j) $.
This matrix is efficiently constructed since we can obtain the constituent dot products in the original space.
The SVD of $ K $ gives you $ USV^{T} $ - $ U $ and $ S $ can be used to construct a reduced dimension dataset.
Now that the intro is out of the way, I wanted to demonstrate some simple connections between algorithms I’ve covered recently:
In previous blog posts, we covered that MDS and PCA are equivalent. A simple proof exists to show that MDS and Kernel PCA are the same thing:
The Isomap algorithm (covered in a previous post) trades the Euclidean distance with edge weights in a nearest neighbor graph. The entries in this proximity matrix are surrogates for distances and thus the Isomap algorithm is an instance of Kernel PCA as well.
It is amazing how several different approaches to dimension-reduction are variants of a single theme.
There is a very simple argument that shows that MDS and PCA achieve the same results.
This argument has 2 important components. The first of these shows that an eigendecomposition of a gram matrix can be used for dimension-reduction.
PCA leverages the singular value decomposition (SVD). Given a matrix $ X $, the SVD is $ X = USV^{T} $.
We drop columns from X by using $ US_{t} $ where we drop some rows and columns from S.
This is also conviently obtained using an eigendecomposition of the covariance matrix.
Working with the gram matrix, we have $ XX^{T} $ and when expressed in terms of $ U $, $ S $ and $ V $, we have $ XX^{T} $ = $ (USV^{T})(VSU^{T}) $.
Simple algebra tells us that this is equal to $ US^{2}U^{T} $. The spectral theorem tells us that this the eigendecomposition of the gram matrix will return this decomposition. $ U $ and $ S $ can be retrieved and a dataset with fewer dimensions can be obtained.
The second part of the argument involves proving that a matrix of distances is indeed a gram matrix. This argument was discussed in a previous post.
Last week, I dropped out of the PhD program at LTI, SCS, Carnegie Mellon University with a masters degree. The specifics of this decision will remain private. If you are going through issues in your PhD program, I will be happy to talk to you and give you advice (should you desire any). You can email me at: shriphanip@gmail.com. You can also call me at 425–623–2604.
Eigendecompositions and Singular Value Decompositions appear in a variety of settings in machine learning and data mining. The eigendecomposition looks like so:
$ \mathbf{Q} $ contains the eigenvectors of $ \mathbf{A} $ and $ \mathbf{\Lambda} $ is a diagonal matrix containing the eigenvalues.
The singular value decomposition looks like:
$ \mathbf{U} $ contains the eigenvectors of the covariance matrix $ \mathbf{A}\mathbf{A^T} $. $ \mathbf{V} $ contains the eigenvectors of the gram matrix $ \mathbf{A^T}\mathbf{A} $.
The truncated variants of these decompositions allow us to compute only a few eigenvalues(vectors) or singular values (vectors).
This is important since (i) a lot of times, the smaller eigenvalues are discarded, and (ii) you don’t want to compute the entire decomposition and retain only a few of the rows and columns of the computed matrices each time.
For core.matrix, I implemented these truncated decompositions in Kublai. Details below.
This is part of a series on a family of dimension-reduction algorithms called non-linear dimension reduction. The goal here is to reduce the dimensions of a dataset (i.e. discard some columns in your data)
In previous posts, I discussed the MDS algorithm and presented some key ideas. In this post, I will describe how those ideas are leveraged in the Isomap algorithm. A clojure implementation based on core.matrix is also included.
In a previous post, I described the MDS (multidimensional scaling) algorithm. This algorithm operates on a proximity matrix which is a matrix of distances between the points in a dataset. From this matrix, a configuration of points is retrieved in a lower dimension.
The MDS strategy is:
Thus, assuming that we work with euclidean distances between points, we retrieve an embedding that PCA itself would produce. Thus, MDS with euclidean distances is identical to PCA.
Then what exactly is the value of running MDS on a dataset?
First, the PCA is not the most powerful approach. For certain datasets, euclidean distances do not capture the shape of the underlying manifold. Running the steps of the MDS on a different distance matrix (at least one that doesn’t contain euclidean distances) can lead to better results - a technique that the Isomap algorithm exploits.
Second, the PCA requires a vector-representation for points. In several situations, the objects in the dataset are not points in a metric space (like strings). We can retrieve distances between objects (say edit-distance for strings) and then obtain a vector-representation for the objects using MDS.
In the next blog post, I will describe and implement the Isomap algorithm that leverages the ideas in the MDS strategy. Isomap constructs a distance matrix that attempts to do a better job at recovering the underlying manifold.
PROOFS:
Soli Deo Gloria
(c) Shriphani Palakodety 2013-2014