Posts tagged 'scalable'

Pegasus: A Modular, Durable Web Crawler For Clojure

2016-01-25

clojure, scale, scalable, crawler, web-crawler

Pegasus is a durable, multithreaded web-crawler for clojure.

I wrote Pegasus after the existing choices in the Java ecosystem left me frustrated.

The more popular crawler projects (Heritrix and Nutch) are clunky and not easy to configure. I have often wanted to be able to supply my own extractors, save payloads directly to a database and so on. Short of digging into large codebases, there isn’t much of an option there.

Tiny crawlers hold all their data structures in memory and are incapable of crawling the entire web. A simple crash somewhere causes you to lose all state built over a long-running crawl. I also want to be able to (at times) modify critical data-structures and functions mid-crawl.

Pegasus gives you the following:

Parallelism using the excellent core.async library.
Disk-backed data structures that allow crawls to survive crashes, system restarts etc. (I am still implementing the restart bits).
Implements the bare minimum politeness needed in crawlers (support for robots.txt and rel='nofollow').

Leveraging a scalable web-crawler in clojure

2015-03-12

clojure, web-crawling, heritrix, java, scalable, www, crawler

Update: I have been working on a nicer fuller crawler in clojure - Pegasus

Nutch and Heritrix are battle-tested web-crawlers. ClueWeb9, ClueWeb12 and the Common-Crawl corpora employed one of these.

Toy crawlers that hold important data-structures in memory fail spectacularly when downloading a large number of pages. Heritrix and Nutch benefit from several man-years of work aimed at stability and scalability.

In a previous project, I wanted to leverage Heritrix’s infrastructure and the flexibility to implement some custom components in Clojure. For instance, being able to extract certain links based on the output of a classifier. Or being able to use simple enlive selectors.

The solution I used was to expose the routines I wanted via a web-server and have Heritrix request these routines.

This allowed me to use libraries like enlive that I am comfortable with and still avail the benefits of the infra Heritrix provides.

What follows is a library - sleipnir, that allows you to do all this in a simple way.