A Clojure DSL for Web-Crawling

When building crawlers, most of the effort is expended in guiding them through a website. For example, if we want to crawl all pages and individual posts on this blog, we extract links like so:

  1. Visit current webpage
  2. Extract pagination links
  3. Extract link to each blog post
  4. Enqueue extracted links
  5. Continue

In this blog post, I present a new DSL that allows you to concisely describe this process.

This DSL is now part of this crawler: https://github.com/shriphani/pegasus

fort-knox: A disk-backed core.cache implementation

core.cache is a small and convenient cache library for clojure. It enables clojure users to quickly roll out caches. In this blog post I am going to describe a clojure implementation which stores cache entries to disk: fort-knox.

In a few recent projects I’ve needed a cache with entries backed to disk. This is a vital requirement in applications that need to be fault-tolerant. LMDB (which I’ve had very positive experiences with) is fast, quick and perfect for this task. fort-knox implements the core.cache spec and stores entries in LMDB. clj-lmdb (subject of a previous blog post) is part of the plan now.

Note that this library deviates slightly from suggestions for core.cache implementations. For instance, the backing store doesn’t implement IPersistentCollection or Associative so fort-knox might deviate from expected behavior. Thus YMMV.

Pegasus: A Modular, Durable Web Crawler For Clojure

Pegasus is a durable, multithreaded web-crawler for clojure.

I wrote Pegasus after the existing choices in the Java ecosystem left me frustrated.

The more popular crawler projects (Heritrix and Nutch) are clunky and not easy to configure. I have often wanted to be able to supply my own extractors, save payloads directly to a database and so on. Short of digging into large codebases, there isn’t much of an option there.

Tiny crawlers hold all their data structures in memory and are incapable of crawling the entire web. A simple crash somewhere causes you to lose all state built over a long-running crawl. I also want to be able to (at times) modify critical data-structures and functions mid-crawl.

Pegasus gives you the following:

  1. Parallelism using the excellent core.async library.
  2. Disk-backed data structures that allow crawls to survive crashes, system restarts etc. (I am still implementing the restart bits).
  3. Implements the bare minimum politeness needed in crawlers (support for robots.txt and rel='nofollow').

A Beautiful Gift With Clojure and LaTeX

For my parents’ 26th anniversary, I decided to convert an online religious text they read into a beautiful, well-typeset book.

The online text was built by volunteers using an archaic version of Microsoft Word and looks like this:

Anyone who has read science or math literature is exposed to the high-quality output LaTeX produces.

Fortunately LaTeX’s abilities extend far beyond the domain of mathematical symbols.

I was able to combine Clojure’s excellent HTML processing infrastructure (enlive) and LaTeX to produce a nice looking document.

The entire process took a few hours.

Here are two pages from the final output:

This blog post contains latex and clojure snippets to produce that output. I am not good at designing books or combining typefaces and would appreciate advice.

Leveraging a scalable web-crawler in clojure

Update: I have been working on a nicer fuller crawler in clojure - Pegasus

Nutch and Heritrix are battle-tested web-crawlers. ClueWeb9, ClueWeb12 and the Common-Crawl corpora employed one of these.

Toy crawlers that hold important data-structures in memory fail spectacularly when downloading a large number of pages. Heritrix and Nutch benefit from several man-years of work aimed at stability and scalability.

In a previous project, I wanted to leverage Heritrix’s infrastructure and the flexibility to implement some custom components in Clojure. For instance, being able to extract certain links based on the output of a classifier. Or being able to use simple enlive selectors.

The solution I used was to expose the routines I wanted via a web-server and have Heritrix request these routines.

This allowed me to use libraries like enlive that I am comfortable with and still avail the benefits of the infra Heritrix provides.

What follows is a library - sleipnir, that allows you to do all this in a simple way.

Implementing Truncated Matrix Decompositions for Core.Matrix

Eigendecompositions and Singular Value Decompositions appear in a variety of settings in machine learning and data mining. The eigendecomposition looks like so:

$$ \mathbf{A}=\mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{-1} $$

$ \mathbf{Q} $ contains the eigenvectors of $ \mathbf{A} $ and $ \mathbf{\Lambda} $ is a diagonal matrix containing the eigenvalues.

The singular value decomposition looks like:

$$ \mathbf{A} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^* $$

$ \mathbf{U} $ contains the eigenvectors of the covariance matrix $ \mathbf{A}\mathbf{A^T} $. $ \mathbf{V} $ contains the eigenvectors of the gram matrix $ \mathbf{A^T}\mathbf{A} $.

The truncated variants of these decompositions allow us to compute only a few eigenvalues(vectors) or singular values (vectors).

This is important since (i) a lot of times, the smaller eigenvalues are discarded, and (ii) you don’t want to compute the entire decomposition and retain only a few of the rows and columns of the computed matrices each time.

For core.matrix, I implemented these truncated decompositions in Kublai. Details below.

The Isomap Algorithm

This is part of a series on a family of dimension-reduction algorithms called non-linear dimension reduction. The goal here is to reduce the dimensions of a dataset (i.e. discard some columns in your data)

In previous posts, I discussed the MDS algorithm and presented some key ideas. In this post, I will describe how those ideas are leveraged in the Isomap algorithm. A clojure implementation based on core.matrix is also included.

Low Dimension Embeddings for Visualization

Representation learning is a hot area in machine learning. In natural language processing (for example), learning long vectors for words has proven quite effective on several tasks. Often, these representations have several hundred dimensions. To perform a qualitative analysis of the learned representations, it helps to visualize them. Thus, we need a principled approach to drop from this high dimensional space to a lower dimensional space (like $ \mathbb{R}^2 $ for instance).

In this blog post, I will discuss the Multidimensional scaling (MDS) algorithm - a manifold learning algorithm that recovers (or at least tries to recover) the underlying manifold that contains the data.

MDS aims to find a configuration in a lower-dimension space that preserves distances between points in the data. In this post, I will provide intuition for this algorithm and an implementation for clojure (incanter).

Per Intellectum, Vis
(c) Shriphani Palakodety 2013-2016