When building crawlers, most of the effort is expended in guiding them through a website. For example, if we want to crawl all pages and individual posts on this blog, we extract links like so:
- Visit current webpage
- Extract pagination links
- Extract link to each blog post
- Enqueue extracted links
In this blog post, I present a new DSL that allows you to concisely describe this process.
This DSL is now part of this crawler: https://github.com/shriphani/pegasus
Update: I have been working on a nicer fuller crawler in clojure - Pegasus
Nutch and Heritrix are battle-tested web-crawlers. ClueWeb9, ClueWeb12 and the Common-Crawl corpora employed one of these.
Toy crawlers that hold important data-structures in memory fail spectacularly when downloading a large number of pages. Heritrix and Nutch benefit from several man-years of work aimed at stability and scalability.
In a previous project, I wanted to leverage Heritrix’s infrastructure and the flexibility to implement some custom components in Clojure. For instance, being able to extract certain links based on the output of a classifier. Or being able to use simple
The solution I used was to expose the routines I wanted via a web-server and have Heritrix request these routines.
This allowed me to use libraries like
enlive that I am comfortable with and still avail the benefits of the infra Heritrix provides.
What follows is a library - sleipnir, that allows you to do all this in a simple way.