Leveraging a scalable web-crawler in clojure

Update: I have been working on a nicer fuller crawler in clojure - Pegasus

Nutch and Heritrix are battle-tested web-crawlers. ClueWeb9, ClueWeb12 and the Common-Crawl corpora employed one of these.

Toy crawlers that hold important data-structures in memory fail spectacularly when downloading a large number of pages. Heritrix and Nutch benefit from several man-years of work aimed at stability and scalability.

In a previous project, I wanted to leverage Heritrix’s infrastructure and the flexibility to implement some custom components in Clojure. For instance, being able to extract certain links based on the output of a classifier. Or being able to use simple enlive selectors.

The solution I used was to expose the routines I wanted via a web-server and have Heritrix request these routines.

This allowed me to use libraries like enlive that I am comfortable with and still avail the benefits of the infra Heritrix provides.

What follows is a library - sleipnir, that allows you to do all this in a simple way.


Clojure/Java String trim

Java’s string trim routine tests for whitespace using Character.isWhitespace and so does Clojure’s clojure.string/trim.

While processing a dataset off the web containing unicode space characters, the trim routine failed to do anything useful. Luckily, a StackOverflow thread suggested using a routine from Google’s guava library. So in Clojure, you can do this:

1
(.trimFrom CharMatcher/WHITESPACE %)

and this will do the job.


Modifying The Heritrix Web Crawler

This is a post I wrote to teach myself about Heritrix and modifying it. There are solid motivations for modifying web-crawlers (say we know how to beat a simple BFS for some specific website). In this post, I will modify a routine that is central to web-crawling - extracting URLs from a webpage.



Twitter: @shriphani
Instagram: @life_of_ess
Fortior Per Mentem
(c) Shriphani Palakodety 2013-2020