Posts tagged 'data-mining'

Dimension Analysis: A Recap

2015-01-21

clojure, machine-learning, data-mining, dimension-reduction, mds, pca, kernel-pca, isomap

In the past few blog posts, I covered some details of popular dimension-reduction techniques and showed some common themes. In this post, I will collect all these materials and tie them together.

The Isomap Algorithm

2014-11-12

clojure, machine-learning, data-science, data-mining, dimension-reduction, isomap, mds, manifold-learning, manifolds, learning

This is part of a series on a family of dimension-reduction algorithms called non-linear dimension reduction. The goal here is to reduce the dimensions of a dataset (i.e. discard some columns in your data)

In previous posts, I discussed the MDS algorithm and presented some key ideas. In this post, I will describe how those ideas are leveraged in the Isomap algorithm. A clojure implementation based on core.matrix is also included.

Powerful Ideas in Manifold Learning

2014-11-02

machine-learning, visualization, data-mining, data-science, isomap, manifold-learning, dimension-reduction, mds, multidimensional-scaling, pca

In a previous post, I described the MDS (multidimensional scaling) algorithm. This algorithm operates on a proximity matrix which is a matrix of distances between the points in a dataset. From this matrix, a configuration of points is retrieved in a lower dimension.

The MDS strategy is:

We have a matrix $D$ for distances between points in the data. This matrix is symmetric.
We express distances as dot-products (using a proof from Schonberg). This means that $D$ is expressed as $X^T X$ . (Observe that $X^T X$ is a matrix of dot-products).
Once we have $X^T X$ , dimension reduction is trivial. Running an eigendecomposition on this matrix will produced centered coordinates. The low-dimension embedding is recovered by discarding eigenvalues (eigenvectors).

Thus, assuming that we work with euclidean distances between points, we retrieve an embedding that PCA itself would produce. Thus, MDS with euclidean distances is identical to PCA.

Then what exactly is the value of running MDS on a dataset?

First, the PCA is not the most powerful approach. For certain datasets, euclidean distances do not capture the shape of the underlying manifold. Running the steps of the MDS on a different distance matrix (at least one that doesn’t contain euclidean distances) can lead to better results - a technique that the Isomap algorithm exploits.

Second, the PCA requires a vector-representation for points. In several situations, the objects in the dataset are not points in a metric space (like strings). We can retrieve distances between objects (say edit-distance for strings) and then obtain a vector-representation for the objects using MDS.

In the next blog post, I will describe and implement the Isomap algorithm that leverages the ideas in the MDS strategy. Isomap constructs a distance matrix that attempts to do a better job at recovering the underlying manifold.

PROOFS:

Distances are dot-products: These notes from Peking U are easy to follow. I have mirrored them here in case that link 404s.

Subotai: Data Mining for HTML Documents

2014-06-24

clojure, html, data-mining, near-duplicate-detection, structural-similarity

I spent the last few months studying and implementing some routines that take a raw HTML document (or documents) and do stuff with it (them). Subotai is a library that consolidates some of these routines. In this blog post I will describe what is currently implemented and what the roadmap is.