A Clojure DSL for Web-Crawling

2016-11-16

clojure, web-crawling, dsl, crawling, scraping

When building crawlers, most of the effort is expended in guiding them through a website. For example, if we want to crawl all pages and individual posts on this blog, we extract links like so:

Visit current webpage
Extract pagination links
Extract link to each blog post
Enqueue extracted links
Continue

In this blog post, I present a new DSL that allows you to concisely describe this process.

This DSL is now part of this crawler: https://github.com/shriphani/pegasus

Enlive provides an idiomatic way to select elements and it forms the foundation of most of the work here. Let us implement the crawler I discussed above:

Since the focus (in this blog post at least) in on extracting links, let us look at the DSL:

(defextractors
  (extract :at-selector [:article :header :h2 :a]

           :follow :href

           :with-regex #"blog.shriphani.com")

  (extract :at-selector [:ul.pagination :a]

           :follow :href

           :with-regex #"blog.shriphani.com"))

And that is it! We specified an elive selector to pull tags, the attribute entry to follow and then filter these URLs with a regex.

With pegasus, the full crawler is expressed in under 10 lines as:

(defn crawl-sp-blog-custom-extractor
  []
  (crawl {:seeds ["http://blog.shriphani.com"]
          :user-agent "Pegasus web crawler"
          :extractor (defextractors
                       (extract :at-selector [:article :header :h2 :a]

                                :follow :href

                                :with-regex #"blog.shriphani.com")

                       (extract :at-selector [:ul.pagination :a]

                                :follow :href

                                :with-regex #"blog.shriphani.com"))

          :corpus-size 20 ;; crawl 20 documents
          :job-dir "/tmp/sp-blog-corpus"}))

Essentially, under 20 lines.

The DSL was partially inspired by the great work done in this crawler for Node.js: https://github.com/rchipka/node-osmosis

Links:

Pegasus: Star