When building crawlers, most of the effort is expended in guiding them through a website. For example, if we want to crawl all pages and individual posts on this blog, we extract links like so:

- Visit current webpage
- Extract pagination links
- Extract link to each blog post
- Enqueue extracted links
- Continue

In this blog post, I present a new DSL that allows you to concisely describe this process.

This DSL is now part of this crawler: https://github.com/shriphani/pegasus

`core.cache`

is a small and convenient cache library for clojure. It enables clojure users to quickly roll out caches. In this blog post I am going to describe a clojure implementation which stores cache entries to disk: fort-knox.

In a few recent projects I’ve needed a cache with entries backed to disk. This is a vital requirement in applications that need to be fault-tolerant. LMDB (which I’ve had very positive experiences with) is fast, quick and perfect for this task. `fort-knox`

implements the `core.cache`

spec and stores entries in LMDB. `clj-lmdb`

(subject of a previous blog post) is part of the plan now.

Note that this library deviates slightly from suggestions for `core.cache`

implementations. For instance, the backing store doesn’t implement `IPersistentCollection`

or `Associative`

so `fort-knox`

might deviate from expected behavior. Thus YMMV.

LMDB is the nicest, no-nonsense, no-surprise key-value store I’ve ever used.

In several benchmarks, LMDB destroys competitors - it is a beloved tool in high-profile circles.

Here’s clojure bindings: https://github.com/shriphani/clj-lmdb

Pegasus is a durable, multithreaded web-crawler for clojure.

I wrote Pegasus after the existing choices in the Java ecosystem left me frustrated.

The more popular crawler projects (Heritrix and Nutch) are clunky and not easy to configure. I have often wanted to be able to supply my own extractors, save payloads directly to a database and so on. Short of digging into large codebases, there isn’t much of an option there.

Tiny crawlers hold all their data structures in memory and are incapable of crawling the entire web. A simple crash somewhere causes you to lose all state built over a long-running crawl. I also want to be able to (at times) modify critical data-structures and functions mid-crawl.

Pegasus gives you the following:

- Parallelism using the excellent
`core.async`

library.
- Disk-backed data structures that allow crawls to survive crashes, system restarts etc. (I am still implementing the restart bits).
- Implements the bare minimum politeness needed in crawlers (support for
`robots.txt`

and `rel='nofollow'`

).

For my parents’ 26th anniversary, I decided to convert an online religious text they read into a beautiful, well-typeset book.

The online text was built by volunteers using an archaic version of Microsoft Word and looks like this:

Anyone who has read science or math literature is exposed to the high-quality output LaTeX produces.

Fortunately LaTeX’s abilities extend far beyond the domain of mathematical symbols.

I was able to combine Clojure’s excellent HTML processing infrastructure (enlive) and LaTeX to produce a nice looking document.

The entire process took a few hours.

Here are two pages from the final output:

This blog post contains latex and clojure snippets to produce that output. I am not good at designing books or combining typefaces and would appreciate advice.

**Update**: I have been working on a nicer fuller crawler in clojure - Pegasus

Nutch and Heritrix are battle-tested web-crawlers. ClueWeb9, ClueWeb12 and the Common-Crawl corpora employed one of these.

Toy crawlers that hold important data-structures in memory fail spectacularly when downloading a large number of pages. Heritrix and Nutch benefit from several man-years of work aimed at stability and scalability.

In a previous project, I wanted to leverage Heritrix’s infrastructure and the flexibility to implement some custom components in Clojure. For instance, being able to extract certain links based on the output of a classifier. Or being able to use simple `enlive`

selectors.

The solution I used was to expose the routines I wanted via a web-server and have Heritrix request these routines.

This allowed me to use libraries like `enlive`

that I am comfortable with and still avail the benefits of the infra Heritrix provides.

What follows is a library - sleipnir, that allows you to do all this in a simple way.

In the past few blog posts, I covered some details of popular dimension-reduction techniques and showed some common themes. In this post, I will collect all these materials and tie them together.

Eigendecompositions and Singular Value Decompositions appear in a variety of settings in machine learning and data mining. The eigendecomposition looks like so:

$$ \mathbf{A}=\mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{-1} $$

$ \mathbf{Q} $ contains the eigenvectors of $ \mathbf{A} $ and $ \mathbf{\Lambda} $ is a diagonal matrix containing the eigenvalues.

The singular value decomposition looks like:

$$ \mathbf{A} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^* $$

$ \mathbf{U} $ contains the eigenvectors of the covariance matrix $ \mathbf{A}\mathbf{A^T} $. $ \mathbf{V} $ contains the eigenvectors of the gram matrix $ \mathbf{A^T}\mathbf{A} $.

The **truncated** variants of these decompositions allow us to compute only a few eigenvalues(vectors) or singular values (vectors).

This is important since (i) a lot of times, the smaller eigenvalues are discarded, and (ii) you don’t want to compute the entire decomposition and retain only a few of the rows and columns of the computed matrices each time.

For core.matrix, I implemented these truncated decompositions in Kublai. Details below.

*This is part of a series on a family of dimension-reduction algorithms called non-linear dimension reduction. The goal here is to reduce the dimensions of a dataset (i.e. discard some columns in your data)*

In previous posts, I discussed the MDS algorithm and presented some key ideas. In this post, I will describe how those ideas are leveraged in the Isomap algorithm. A clojure implementation based on core.matrix is also included.

Representation learning is a hot area in machine learning. In natural language processing (for example), learning long vectors for words has proven quite effective on several tasks. Often, these representations have several hundred dimensions. To perform a qualitative analysis of the learned representations, it helps to visualize them. Thus, we need a principled approach to drop from this high dimensional space to a lower dimensional space (like $ \mathbb{R}^2 $ for instance).

In this blog post, I will discuss the Multidimensional scaling (MDS) algorithm - a manifold learning algorithm that recovers (or at least tries to recover) the underlying manifold that contains the data.

MDS aims to find a configuration in a lower-dimension space that preserves distances between points in the data. In this post, I will provide intuition for this algorithm and an implementation for clojure (incanter).