<?xml version="1.0" encoding="utf-8"?> 
<rss version="2.0">
 <channel>
  <title>SHRIPHANI PALAKODETY: Posts tagged 'heritrix'</title>
  <description>SHRIPHANI PALAKODETY: Posts tagged 'heritrix'</description>
  <link>http://blog.shriphani.com/tags/heritrix.html</link>
  <lastBuildDate>Thu, 12 Mar 2015 09:29:39 UT</lastBuildDate>
  <pubDate>Thu, 12 Mar 2015 09:29:39 UT</pubDate>
  <ttl>1800</ttl>
  <item>
   <title>Leveraging a scalable web-crawler in clojure</title>
   <link>http://blog.shriphani.com/2015/03/12/leveraging-a-scalable-web-crawler-in-clojure/?utm_source=heritrix&amp;utm_medium=RSS</link>
   <guid>urn:http-blog-shriphani-com:-2015-03-12-leveraging-a-scalable-web-crawler-in-clojure</guid>
   <pubDate>Thu, 12 Mar 2015 09:29:39 UT</pubDate>
   <description>&lt;html&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: I have been working on a nicer fuller crawler in clojure - &lt;a href="https://github.com/shriphani/pegasus"&gt;Pegasus&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="http://nutch.apache.org/"&gt;Nutch&lt;/a&gt; and &lt;a href="https://webarchive.jira.com/secure/Dashboard.jspa"&gt;Heritrix&lt;/a&gt; are battle-tested web-crawlers. &lt;a href="http://www.lemurproject.org/clueweb09.php/"&gt;ClueWeb9&lt;/a&gt;, &lt;a href="http://www.lemurproject.org/clueweb12.php/"&gt;ClueWeb12&lt;/a&gt; and the Common-Crawl corpora employed one of these.&lt;/p&gt;

&lt;p&gt;Toy crawlers that hold important data-structures in memory fail spectacularly when downloading a large number of pages. Heritrix and Nutch benefit from several man-years of work aimed at stability and scalability.&lt;/p&gt;

&lt;p&gt;In a previous project, I wanted to leverage Heritrix&amp;rsquo;s infrastructure and the flexibility to implement some custom components in Clojure. For instance, being able to extract certain links based on the output of a classifier. Or being able to use simple &lt;code&gt;enlive&lt;/code&gt; selectors.&lt;/p&gt;

&lt;p&gt;The solution I used was to expose the routines I wanted via a web-server and have Heritrix request these routines.&lt;/p&gt;

&lt;p&gt;This allowed me to use libraries like &lt;code&gt;enlive&lt;/code&gt; that I am comfortable with and still avail the benefits of the infra Heritrix provides.&lt;/p&gt;

&lt;p&gt;What follows is a library - &lt;a href="https://github.com/shriphani/sleipnir"&gt;sleipnir&lt;/a&gt;, that allows you to do all this in a simple way.&lt;/p&gt;
&lt;!-- more--&gt;

&lt;h2 id="intuition"&gt;Intuition&lt;/h2&gt;

&lt;p&gt;You need to specify two routines: (i) an extractor that takes a web-page and extracts links from it, and (ii) a writer that takes a body and other information and writes in whatever format you want.&lt;/p&gt;

&lt;h2 id="getting-started"&gt;Getting Started&lt;/h2&gt;

&lt;p&gt;First, download and spin up a Heritrix instance (REQUIRED for a crawl to complete).&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;wget https://s3-us-west-2.amazonaws.com/sleipnir-heritrix/heritrix-3.3.0-SNAPSHOT-dist.zip
unzip heritrix-3.3.0-SNAPSHOT-dist.zip
cd heritrix-3.3.0-SNAPSHOT-dist
./bin/heritrix -a admin:admin&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now Heritrix is running at https://localhost:8443 and can be accessed with the username/pass : admin/admin.&lt;/p&gt;

&lt;p&gt;Next, let us set up a simple crawl using clojure routines.&lt;/p&gt;

&lt;p&gt;We start with the imports:&lt;/p&gt;

&lt;div class="brush: clojure"&gt;
 &lt;table class="sourcetable"&gt;
  &lt;tbody&gt;
   &lt;tr&gt;
    &lt;td class="linenos"&gt;
     &lt;div class="linenodiv"&gt;
      &lt;pre&gt;1
2
3
4
5
6&lt;/pre&gt;&lt;/div&gt;&lt;/td&gt;
    &lt;td class="code"&gt;
     &lt;div class="source"&gt;
      &lt;pre&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;ns &lt;/span&gt;&lt;span class="nv"&gt;sleipnir.demo&lt;/span&gt;
  &lt;span class="s"&gt;"Dude this is the demo"&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;:require&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;net.cgrand.enlive-html&lt;/span&gt; &lt;span class="ss"&gt;:as&lt;/span&gt; &lt;span class="nv"&gt;html&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;sleipnir.handler&lt;/span&gt; &lt;span class="ss"&gt;:as&lt;/span&gt; &lt;span class="nv"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;org.bovinegenius.exploding-fish&lt;/span&gt; &lt;span class="ss"&gt;:as&lt;/span&gt; &lt;span class="nv"&gt;uri&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;:import&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;java.io&lt;/span&gt; &lt;span class="nv"&gt;StringReader&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;Say, I want to walk through reddit&amp;rsquo;s pagination. We use enlive selectors for our extractor code:&lt;/p&gt;

&lt;div class="brush: clojure"&gt;
 &lt;table class="sourcetable"&gt;
  &lt;tbody&gt;
   &lt;tr&gt;
    &lt;td class="linenos"&gt;
     &lt;div class="linenodiv"&gt;
      &lt;pre&gt; 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15&lt;/pre&gt;&lt;/div&gt;&lt;/td&gt;
    &lt;td class="code"&gt;
     &lt;div class="source"&gt;
      &lt;pre&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;defn &lt;/span&gt;&lt;span class="nv"&gt;reddit-pagination-extractor&lt;/span&gt;
  &lt;span class="s"&gt;"Pulls reddit pagination using enlive"&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;url&lt;/span&gt; &lt;span class="nv"&gt;body&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;let &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;resource&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;-&amp;gt; &lt;/span&gt;&lt;span class="nv"&gt;body&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;StringReader.&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nv"&gt;html/html-resource&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nv"&gt;anchors&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;html/select&lt;/span&gt; &lt;span class="nv"&gt;resource&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="ss"&gt;:span.nextprev&lt;/span&gt; &lt;span class="ss"&gt;:a&lt;/span&gt;&lt;span class="p"&gt;])]&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;
     &lt;span class="nv"&gt;identity&lt;/span&gt;
     &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;
      &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;fn &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;an-anchor&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;println &lt;/span&gt;&lt;span class="nv"&gt;an-anchor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;try&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;uri/resolve-uri&lt;/span&gt; &lt;span class="nv"&gt;url&lt;/span&gt;
                              &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;-&amp;gt; &lt;/span&gt;&lt;span class="nv"&gt;an-anchor&lt;/span&gt;
                                  &lt;span class="ss"&gt;:attrs&lt;/span&gt;
                                  &lt;span class="ss"&gt;:href&lt;/span&gt;&lt;span class="p"&gt;))))&lt;/span&gt;
      &lt;span class="nv"&gt;anchors&lt;/span&gt;&lt;span class="p"&gt;))))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;Then, we want to store the submitted links in some location&lt;/p&gt;

&lt;div class="brush: clojure"&gt;
 &lt;table class="sourcetable"&gt;
  &lt;tbody&gt;
   &lt;tr&gt;
    &lt;td class="linenos"&gt;
     &lt;div class="linenodiv"&gt;
      &lt;pre&gt; 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18&lt;/pre&gt;&lt;/div&gt;&lt;/td&gt;
    &lt;td class="code"&gt;
     &lt;div class="source"&gt;
      &lt;pre&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;defn &lt;/span&gt;&lt;span class="nv"&gt;reddit-submission-links-writer&lt;/span&gt;
  &lt;span class="s"&gt;"Gets links to reddit submissions"&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;url&lt;/span&gt; &lt;span class="nv"&gt;body&lt;/span&gt; &lt;span class="nv"&gt;wrtr&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;let &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;resource&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;-&amp;gt; &lt;/span&gt;&lt;span class="nv"&gt;body&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;StringReader.&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nv"&gt;html/html-resource&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nv"&gt;submissions&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;html/select&lt;/span&gt; &lt;span class="nv"&gt;resource&lt;/span&gt;
                                 &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="ss"&gt;:p.title&lt;/span&gt; &lt;span class="ss"&gt;:a.title&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="nv"&gt;links&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;
               &lt;span class="nv"&gt;identity&lt;/span&gt;
               &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;
                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;fn &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;an-anchor&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;try&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;uri/resolve-uri&lt;/span&gt; &lt;span class="nv"&gt;url&lt;/span&gt;
                                        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;-&amp;gt; &lt;/span&gt;&lt;span class="nv"&gt;an-anchor&lt;/span&gt;
                                            &lt;span class="ss"&gt;:attrs&lt;/span&gt;
                                            &lt;span class="ss"&gt;:href&lt;/span&gt;&lt;span class="p"&gt;))))&lt;/span&gt;
                &lt;span class="nv"&gt;submissions&lt;/span&gt;&lt;span class="p"&gt;))]&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;doseq &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;link&lt;/span&gt; &lt;span class="nv"&gt;links&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
     &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;binding &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;*out*&lt;/span&gt; &lt;span class="nv"&gt;wrtr&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
       &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;println &lt;/span&gt;&lt;span class="nv"&gt;link&lt;/span&gt;&lt;span class="p"&gt;)))))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;And then set up and execute the crawl. The config object has a ton of options (I&amp;rsquo;ll flesh the documentation out soon). Several of these options tweak Heritrix&amp;rsquo;s settings.&lt;/p&gt;

&lt;div class="brush: clojure"&gt;
 &lt;table class="sourcetable"&gt;
  &lt;tbody&gt;
   &lt;tr&gt;
    &lt;td class="linenos"&gt;
     &lt;div class="linenodiv"&gt;
      &lt;pre&gt;1
2
3
4
5
6
7
8
9&lt;/pre&gt;&lt;/div&gt;&lt;/td&gt;
    &lt;td class="code"&gt;
     &lt;div class="source"&gt;
      &lt;pre&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;handler/crawl&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="ss"&gt;:heritrix-addr&lt;/span&gt; &lt;span class="s"&gt;"https://localhost:8443/engine"&lt;/span&gt;
                &lt;span class="ss"&gt;:job-dir&lt;/span&gt;       &lt;span class="s"&gt;"/Users/shriphani/Documents/reddit-job"&lt;/span&gt;
                &lt;span class="ss"&gt;:username&lt;/span&gt;      &lt;span class="s"&gt;"admin"&lt;/span&gt;
                &lt;span class="ss"&gt;:password&lt;/span&gt;      &lt;span class="s"&gt;"admin"&lt;/span&gt;
                &lt;span class="ss"&gt;:seeds-file&lt;/span&gt;    &lt;span class="s"&gt;"/Users/shriphani/Documents/reddit-job/seeds.txt"&lt;/span&gt;
                &lt;span class="ss"&gt;:contact-url&lt;/span&gt;   &lt;span class="s"&gt;"http://shriphani.com/"&lt;/span&gt;
                &lt;span class="ss"&gt;:out-file&lt;/span&gt;      &lt;span class="s"&gt;"/tmp/bodies.clj"&lt;/span&gt;
                &lt;span class="ss"&gt;:extractor&lt;/span&gt;     &lt;span class="nv"&gt;reddit-pagination-extractor&lt;/span&gt;
                &lt;span class="ss"&gt;:writer&lt;/span&gt;        &lt;span class="nv"&gt;reddit-submission-links-writer&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;In the config above, we specify where heritrix is launched, the job directory, the payload directory and the extraction and writer routines.&lt;/p&gt;

&lt;p&gt;The result is a heritrix job that walks through the pagination and dumps the submitted links to &lt;code&gt;/tmp/bodies.clj&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Here&amp;rsquo;s a screengrab of the job:&lt;/p&gt;

&lt;p&gt;&lt;img src="/img/heritrix-sleipnir-demo.png" /&gt;&lt;/p&gt;

&lt;p&gt;And here&amp;rsquo;s a snapshot of the recorded submission links:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;https://www.reddit.com/r/tifu/comments/2ys7wr/tifu_by_attempting_to_clean_the_kitchen/
http://uatoday.tv/politics/ukraine-calls-for-russian-documentary-on-crimea-to-be-sent-to-hague-tribunal-414713.html
https://www.reddit.com/r/Music/comments/2ys88r/check_out_our_free_ep_feathers_by_divide_of/
https://s-media-cache-ak0.pinimg.com/736x/19/f6/18/19f618637135ec676d0dfdcd4d23b542.jpg
http://imgur.com/HqG6Udq
https://www.reddit.com/r/Jokes/comments/2ys8a9/did_you_hear_about_the_nympho_waitress/
https://www.reddit.com/r/AskReddit/comments/2ys8bb/hey_reddit_what_is_a_great_classic_or_family/
http://imgur.com/qTKH4pA
...
...&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Library: &lt;a href="https://github.com/shriphani/sleipnir"&gt;https://github.com/shriphani/sleipnir&lt;/a&gt;&lt;/p&gt;&lt;/html&gt;</description></item>
  <item>
   <title>Modifying The Heritrix Web Crawler</title>
   <link>http://blog.shriphani.com/2014/03/13/modifying-the-heritrix-web-crawler/?utm_source=heritrix&amp;utm_medium=RSS</link>
   <guid>urn:http-blog-shriphani-com:-2014-03-13-modifying-the-heritrix-web-crawler</guid>
   <pubDate>Thu, 13 Mar 2014 05:59:07 UT</pubDate>
   <description>&lt;html&gt;
&lt;p&gt;This is a post I wrote to teach myself about Heritrix and modifying it. There are solid motivations for modifying web-crawlers (say we know how to beat a simple BFS for some specific website). In this post, I will modify a routine that is central to web-crawling - extracting URLs from a webpage.&lt;/p&gt;
&lt;!-- more--&gt;

&lt;p&gt;First, I am going to put together a simple extractor in Heritrix. This extractor uses an XPath (I used a very trivial XPath for the sake of this example). I use the HtmlCleaner library for parsing the supplied HTML and then used the XPath classes that ship with java (I have personally found that most Html parsing libraries bundle partial XPath implementations and I typically use more complex queries for my research so I prefer dealing with the &lt;code&gt;org.w3c.xml.dom&lt;/code&gt; documents.&lt;/p&gt;

&lt;p&gt;This is what the extractor class looks like. It is super simple:&lt;/p&gt;

&lt;script src="https://gist.github.com/shriphani/9574641.js"&gt;&lt;/script&gt;

&lt;p&gt;Now, to see it in action, you need to create a Heritrix job and specify that this is the extractor you want to use. I have a test job that crawls my blog. A heritrix job contains a configuration file where you can specify the extractors and some other details (seed links and all that). In this file, I specified the extractor class like so:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;bean id="extractorHtml" class="org.archive.modules.extractor.XPathExtractor"&amp;gt;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;(incidentally the entire file looks like &lt;a href="https://gist.github.com/shriphani/9574658"&gt;this&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;I was subsequently able to process a webpage and all that without too much fuss. In the near future, I plan to describe some of the more interesting stuff I&amp;rsquo;ve been able to do with heritrix.&lt;/p&gt;&lt;/html&gt;</description></item></channel></rss>