Posts tagged 'web-crawler'

Pegasus: A Modular, Durable Web Crawler For Clojure

2016-01-25

clojure, scale, scalable, crawler, web-crawler

Pegasus is a durable, multithreaded web-crawler for clojure.

I wrote Pegasus after the existing choices in the Java ecosystem left me frustrated.

The more popular crawler projects (Heritrix and Nutch) are clunky and not easy to configure. I have often wanted to be able to supply my own extractors, save payloads directly to a database and so on. Short of digging into large codebases, there isn’t much of an option there.

Tiny crawlers hold all their data structures in memory and are incapable of crawling the entire web. A simple crash somewhere causes you to lose all state built over a long-running crawl. I also want to be able to (at times) modify critical data-structures and functions mid-crawl.

Pegasus gives you the following:

Parallelism using the excellent core.async library.
Disk-backed data structures that allow crawls to survive crashes, system restarts etc. (I am still implementing the restart bits).
Implements the bare minimum politeness needed in crawlers (support for robots.txt and rel='nofollow').