Leveraging a scalable web-crawler in clojure

Thu, 12 Mar 2015 09:29:39 UT

Update: I have been working on a nicer fuller crawler in clojure - Pegasus

Nutch and Heritrix are battle-tested web-crawlers. ClueWeb9, ClueWeb12 and the Common-Crawl corpora employed one of these.

Toy crawlers that hold important data-structures in memory fail spectacularly when downloading a large number of pages. Heritrix and Nutch benefit from several man-years of work aimed at stability and scalability.

In a previous project, I wanted to leverage Heritrix’s infrastructure and the flexibility to implement some custom components in Clojure. For instance, being able to extract certain links based on the output of a classifier. Or being able to use simple enlive selectors.

The solution I used was to expose the routines I wanted via a web-server and have Heritrix request these routines.

This allowed me to use libraries like enlive that I am comfortable with and still avail the benefits of the infra Heritrix provides.

What follows is a library - sleipnir, that allows you to do all this in a simple way.

Intuition

You need to specify two routines: (i) an extractor that takes a web-page and extracts links from it, and (ii) a writer that takes a body and other information and writes in whatever format you want.

Getting Started

First, download and spin up a Heritrix instance (REQUIRED for a crawl to complete).

wget https://s3-us-west-2.amazonaws.com/sleipnir-heritrix/heritrix-3.3.0-SNAPSHOT-dist.zip
unzip heritrix-3.3.0-SNAPSHOT-dist.zip
cd heritrix-3.3.0-SNAPSHOT-dist
./bin/heritrix -a admin:admin

Now Heritrix is running at https://localhost:8443 and can be accessed with the username/pass : admin/admin.

Next, let us set up a simple crawl using clojure routines.

We start with the imports:

(ns sleipnir.demo
  "Dude this is the demo"
  (:require [net.cgrand.enlive-html :as html]
            [sleipnir.handler :as handler]
            [org.bovinegenius.exploding-fish :as uri])
  (:import [java.io StringReader]))

Say, I want to walk through reddit’s pagination. We use enlive selectors for our extractor code:

(defn reddit-pagination-extractor
  "Pulls reddit pagination using enlive"
  [url body]
  (let [resource (-> body (StringReader.) html/html-resource)
        anchors  (html/select resource [:span.nextprev :a])]
    (filter
     identity
     (map
      (fn [an-anchor]
        (println an-anchor)
        (try (uri/resolve-uri url
                              (-> an-anchor
                                  :attrs
                                  :href))))
      anchors))))

Then, we want to store the submitted links in some location

(defn reddit-submission-links-writer
  "Gets links to reddit submissions"
  [url body wrtr]
  (let [resource (-> body (StringReader.) html/html-resource)
        submissions (html/select resource
                                 [:p.title :a.title])
        links (filter
               identity
               (map
                (fn [an-anchor]
                  (try (uri/resolve-uri url
                                        (-> an-anchor
                                            :attrs
                                            :href))))
                submissions))]
    (doseq [link links]
     (binding [*out* wrtr]
       (println link)))))

And then set up and execute the crawl. The config object has a ton of options (I’ll flesh the documentation out soon). Several of these options tweak Heritrix’s settings.

(handler/crawl {:heritrix-addr "https://localhost:8443/engine"
                :job-dir       "/Users/shriphani/Documents/reddit-job"
                :username      "admin"
                :password      "admin"
                :seeds-file    "/Users/shriphani/Documents/reddit-job/seeds.txt"
                :contact-url   "http://shriphani.com/"
                :out-file      "/tmp/bodies.clj"
                :extractor     reddit-pagination-extractor
                :writer        reddit-submission-links-writer})

In the config above, we specify where heritrix is launched, the job directory, the payload directory and the extraction and writer routines.

The result is a heritrix job that walks through the pagination and dumps the submitted links to /tmp/bodies.clj.

Here’s a screengrab of the job:

And here’s a snapshot of the recorded submission links:

https://www.reddit.com/r/tifu/comments/2ys7wr/tifu_by_attempting_to_clean_the_kitchen/
http://uatoday.tv/politics/ukraine-calls-for-russian-documentary-on-crimea-to-be-sent-to-hague-tribunal-414713.html
https://www.reddit.com/r/Music/comments/2ys88r/check_out_our_free_ep_feathers_by_divide_of/
https://s-media-cache-ak0.pinimg.com/736x/19/f6/18/19f618637135ec676d0dfdcd4d23b542.jpg
http://imgur.com/HqG6Udq
https://www.reddit.com/r/Jokes/comments/2ys8a9/did_you_hear_about_the_nympho_waitress/
https://www.reddit.com/r/AskReddit/comments/2ys8bb/hey_reddit_what_is_a_great_classic_or_family/
http://imgur.com/qTKH4pA
...
...

Library: https://github.com/shriphani/sleipnir

Modifying The Heritrix Web Crawler

Thu, 13 Mar 2014 05:59:07 UT

This is a post I wrote to teach myself about Heritrix and modifying it. There are solid motivations for modifying web-crawlers (say we know how to beat a simple BFS for some specific website). In this post, I will modify a routine that is central to web-crawling - extracting URLs from a webpage.

First, I am going to put together a simple extractor in Heritrix. This extractor uses an XPath (I used a very trivial XPath for the sake of this example). I use the HtmlCleaner library for parsing the supplied HTML and then used the XPath classes that ship with java (I have personally found that most Html parsing libraries bundle partial XPath implementations and I typically use more complex queries for my research so I prefer dealing with the org.w3c.xml.dom documents.

This is what the extractor class looks like. It is super simple:

Now, to see it in action, you need to create a Heritrix job and specify that this is the extractor you want to use. I have a test job that crawls my blog. A heritrix job contains a configuration file where you can specify the extractors and some other details (seed links and all that). In this file, I specified the extractor class like so:

<bean id="extractorHtml" class="org.archive.modules.extractor.XPathExtractor">

(incidentally the entire file looks like this).

I was subsequently able to process a webpage and all that without too much fuss. In the near future, I plan to describe some of the more interesting stuff I’ve been able to do with heritrix.

SHRIPHANI PALAKODETY: Posts tagged 'heritrix'

Leveraging a scalable web-crawler in clojure

Intuition

Getting Started

Modifying The Heritrix Web Crawler