## Modifying The Heritrix Web Crawler

This is a post I wrote to teach myself about Heritrix and modifying it. There are solid motivations for modifying web-crawlers (say we know how to beat a simple BFS for some specific website). In this post, I will modify a routine that is central to web-crawling - extracting URLs from a webpage.

## Clojure Blekko API

Back when Blekko had an API, I threw this implementation of their API together for a small project I was working on. You can only query blekko with this (the API allows slashtag manipulation and all that which I didn’t bother with).

Anyway, you can do this:

## Usage

With leiningen:

[clj_blekko "0.1.0"]

With maven:

<dependency>
<groupId>clj_blekko</groupId>
<artifactId>clj_blekko</artifactId>
<version>0.1.0</version>
</dependency>

You can just do this:

user> (use 'clj-blekko.core :reload)

user> (clojure.pprint/pprint (:RESULT (run-query "shriphani" api-key :json true :page 2)))
[{:c 1,
:display_url
"kindle.amazon.com/profile/Werner-Vogels/160/followers/65",
:n_group 41,
:short_host "kindle.amazon.com",
:short_host_url "http://kindle.amazon.com/",
:snippet
"1 Followers 0 Books with Public Notes.  Peter van der Reijden.  5 Followers 0 Books with Public Notes.  4 Followers 0 Books with Public Notes.  0 Followers 0 Books with Public Notes.",
:url
"https://kindle.amazon.com/profile/Werner-Vogels/160/followers/65",
:url_title "Amazon Kindle - Werner Vogels - Followers"}
{:c 2,
:display_url "yelp.com/biz/M8P96GmGImU0KOrFMfVCKw",
:n_group 42,
:short_host "yelp.com",
:short_host_url "http://www.yelp.com/",
:snippet
"33 reviews for Blue Nile Restaurant.  Blue Nile Restaurant.  117 Northwestern Ave Ste 2 West Lafayette, IN 47906.  Mon-Sat 11 am - 11 pm.  Sun 12 pm - 10 pm.",
:url "http://www.yelp.com/biz/M8P96GmGImU0KOrFMfVCKw",
:url_title
"<strong>Yelp.com</strong> - Blue Nile Restaurant - West Lafayette, IN"}
{:c 3,
:display_url "yelp.com/biz/blue-ni
le-restaurant-west-lafayette",
:n_group 43,
:short_host "yelp.com",
:short_host_url "http://www.yelp.com/",
:snippet
"34 reviews for Blue Nile Restaurant.  Blue Nile Restaurant.  117 Northwestern Ave Ste 2 West Lafayette, IN 47906.  Mon-Sat 11 am - 11 pm.  Sun 12 pm - 10 pm.",
:url "http://www.yelp.com/biz/blue-nile-restaurant-west-lafayette",
:url_title
"<strong>Yelp.com</strong> - Blue Nile Restaurant - West Lafayette, IN"}
{:c 4,
:n_group 44,
:short_host "amazon.com",
:short_host_url "http://www.amazon.com/",
:snippet
"Frequently Bought Together.  Customers Who Bought This Item Also Bought.  More About the Authors.  Very Bad Poetry Paperback.  Very Bad Poetry (Vintage) and over one million other books are available for Amazon Kindle.",
:url_title
"<strong>Amazon.com</strong> - Very Bad Poetry - Ross Petras, Kathryn Petras - 9780679776222 - Amazo
n.com - Books"}
{:c 5,
:display_url "yelp.com/biz/shaukin-indian-fast-food-west-lafayette",
:n_group 45,
:short_host "yelp.com",
:short_host_url "http://www.yelp.com/",
:snippet
"23 reviews for Shaukin Indian Fast Food.  138 S River Rd West Lafayette, IN 47906.  Tue-Thu 4 pm - 10 pm.  Fri-Sun 12 pm - 10 pm.  Good for Kids.",
:url
"http://www.yelp.com/biz/shaukin-indian-fast-food-west-lafayette",
:url_title
"<strong>Yelp.com</strong> - Shaukin Indian Fast Food - West Lafayette, IN"}
{:c 6,
:display_url
"reddit.com/.../who_here_doesnt_sympathize_with_g20_rioters",
:n_group 46,
:short_host "reddit.com",
:short_host_url "http://www.reddit.com/",
:snippet
"Login or register in seconds.  Limit my search to /r/worldnews.  Use the following search parameters to narrow your results.  Search for &quot;text&quot; in url.  Search for &quot;text&quot; in self post contents.",
:url
:url_title

"<strong>Too Many Requests</strong> - Who here doesn&#39;t sympathize with G20 rioters destroying people&#39;s property - worldnews"}
{:c 7,
:display_url "yelp.ie/biz/shaukin-indian-fast-food-west-lafayette",
:n_group 47,
:short_host "yelp.ie",
:short_host_url "http://www.yelp.ie/",
:snippet
"Recommended Reviews for Shaukin Indian Fast Food.  Cookies help us deliver our services.  By using our services, you agree to our use of cookies.  138 S River Rd West Lafayette, IN 47906.  Good for Children.",
:url
"http://www.yelp.ie/biz/shaukin-indian-fast-food-west-lafayette",
:url_title "Shaukin Indian Fast Food - West Lafayette, IN"}
{:c 8,
:display_url
"en.yelp.be/biz/shaukin-indian-fast-food-west-lafayette",
:n_group 48,
:short_host "en.yelp.be",
:short_host_url "http://en.yelp.be/",
:snippet
"25 reviews for Shaukin Indian Fast Food.  138 S River Rd West Lafayette, IN 47906.  Good for Children.  Accepts Credit Cards.  Good for Groups.",
:url "http://en.yelp.be/biz/shaukin-ind
ian-fast-food-west-lafayette",
:url_title "Shaukin Indian Fast Food - West Lafayette, IN"}
{:c 9,
:display_url "quora.com/Colin-Ho",
:n_group 49,
:short_host "quora.com",
:short_host_url "http://www.quora.com/",
:snippet
:url "http://www.quora.com/Colin-Ho",
:url_title "Colin Ho - <strong>Quora</strong>"}
{:c 10,
:display_url
"quora.com/...change-the-world-the-most-within-the-next-25-years",
:n_group 50,
:short_host "quora.com",
:short_host_url "http://www.quora.com/",
:snippet
:url
"http://www.quora.com/Which-technological-innovatio
n-will-change-the-world-the-most-within-the-next-25-years",
:url_title
"<strong>Quora</strong> - Which technological innovation will change the world the most within the next 25 years - Quora"}
{:display_url "meetup.com/Clojure-PGH",
:short_host "meetup.com",
:c 11,
:url_title
"<strong>Meetup</strong> - Pittsburgh Clojure Users Group (Pittsburgh, PA) - Meetup",
:n_group 51,
:doc_date "Mar 2010",
:url "http://www.meetup.com/Clojure-PGH/",
:short_host_url "http://www.meetup.com/",
:snippet
"Welcome old lispers and new schemers.  Come to our next event to meet other programmers interested in the latest secret alien technology.  We are building a community of people who want to learn from and teach others about Clojure.  Talking with printed notes is encouraged, powerpoints are forbidden.",
:doc_date_iso "2010-03-04 00:00:00"}
{:c 12,
:display_url
"quora.com/.../How-do-you-ensure-that-TAs-for-introductory-CS...",
:n_group 52,
:short_host "quora.com",
:short_host_url "http://w
ww.quora.com/",
:snippet
:url
"http://www.quora.com/Computer-Science-Education/How-do-you-ensure-that-TAs-for-introductory-CS-classes-teach-at-a-high-quality",
:url_title
"<strong>Quora</strong> - Computer Science Education - How do you ensure that TAs for introductory CS classes teach at ... "}
{:c 13,
:display_url
:n_group 53,
:short_host "quora.com",
:short_host_url "http://www.quora.com/",
:snippet
"If I were to start learning Scala and the functional programming paradigm (coming from imperative languages), am I getting early or late to the party.  I&#39;m a functional learner myself.",
:url
:url_title
"<strong>Quora</strong>
- Is it already too late to get on the wave of functional programming and Scala"}
{:c 14,
:n_group 54,
:short_host "python.org",
:short_host_url "http://www.python.org/",
:snippet
"Contributors to the Python Documentation.  About these documents.  These documents are generated from reStructuredText sources by Sphinx, a document processor specifically written for the Python documentation.",
:url_title
"<strong>Python Programming Language</strong> - About these documents — Python v3.0.1 documentation"}]

Anyway, thought it might be useful (even though blekko’s gone and nuked free access to their API).

## clojure scraping overview

Earlier this week I gave a talk on scraping with clojure - primarily using clj-xpath and enlive at the Pittsburgh clojure meetup group. Slides and code are linked to below.

## All It Took Was An AHA!

I can clearly recall why I became a computer scientist. I was sitting in a class and we were discussing how cons was implemented. And then I saw this definition:

 1 2 3 4 5 6 7 8 9 (define (cons a b) (lambda (x) (if (= x 1) a b))) (define (first l) (l 1)) (define (rest l) (l 2)) 

The lambda calculus and the material in the little schemer kept me in the field (computer-science i.e.) and assured me that there would never be a dearth of aha! moments in my education.

Good educators can deliver such aha! moments in every single lecture. A good textbook can do it several times each chapter.

I have since tried to find material that delivers such aha! moments.

Hopefully, I will encounter them for the rest of my life in whatever I do.

## The Sistine Chapel

The Vatican has a fantastic interactive view of Michelangelo’s work in the Sistine Chapel.

## Robust Principal Component Pursuit - Background Matrix Recovery

I recently spent some time working on a simple linear algebra problem - decompose a matrix $M$ into a low-rank component $L$ and a sparse component $S$. The algorithm I used was very trivial to implement (and parallelize using map-reduce).

In this post, I will implement this very simple algorithm, explain the objective function and demonstrate its (amazing) effectiveness on a surveillance-camera dataset.

## Probabilistic Counting

I recently got my hands on the common-crawl URL dataset from here and wanted to compute some stats like the number of unique domains contained in the corpus. This was the perfect opportunity to whip up a script that implemented Durande et. al’s log-log paper.

## Tree Edit Distance Enlive Version

In my last post, I presented a code-dump that computed a restricted version of the tree edit distance algorithm. I was able to achieve a decent speed-up using enlive. Here’s a code-dump:

## Tree Edit Distance in Clojure

I had to throw together an implementation of tree-edit distance for clustering web-pages based on their structure. It performs reasonably quickly. The algorithm itself. The repo is here.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 (ns tree-edit-distance.core (:use [clojure.pprint]) (:import (org.htmlcleaner HtmlCleaner DomSerializer CleanerProperties) (org.w3c.dom Document))) (defn init "Perform the correct initialization" [m n c1 c2 del-cost ins-cost] (let [M (make-array Integer/TYPE (inc m) (inc n))] (do (doseq [i (range (inc m)) j (range (inc n))] (aset M i j (int (+ (* del-cost c1) (* ins-cost c2))))) M))) (defn num-children "Expects a html tree" [a-tree] (if(.hasChildNodes a-tree) (.getLength (.getChildNodes a-tree)) 0)) (defn tree-children "Return level 1 children" [a-tree] (let [n (num-children a-tree) cs (.getChildNodes a-tree)] (map #(.item cs %) (range n)))) (defn tree-descendants [a-tree] (if (.hasChildNodes a-tree) (concat (tree-children a-tree) (flatten (map tree-descendants (tree-children a-tree)))) [])) (declare rtdm-edit-distance) (defn invert-cost [t1 t2 del-cost ins-cost sub-cost] (let [t1-desc (tree-descendants t1) t2-desc (tree-descendants t2)] (- (+ (* del-cost (count t1-desc)) (* ins-cost (count t2-desc))) (rtdm-edit-distance t1 t2 del-cost ins-cost sub-cost)))) (defn rtdm-edit-distance "The RTDM algorithm for computing edit-distance. The trees are assumed to be org.w3c.dom.Documents" [tree-1 tree-2 del-cost ins-cost sub-cost] (let [m (num-children tree-1) n (num-children tree-2) t1-children (tree-children tree-1) t2-children (tree-children tree-2) t1-desc (tree-descendants tree-1) t2-desc (tree-descendants tree-2) M (init m n (count t1-desc) (count t2-desc) del-cost ins-cost)] (doseq [i (range m) j (range n)] (let [c-i (nth t1-children i) c-j (nth t2-children j) c-i-desc (tree-descendants c-i) c-j-desc (tree-descendants c-j) del (aget M i (inc j)) ins (aget M (inc i) j) sub-i (- (aget M i j) del-cost ins-cost) sub (if (.isEqualNode c-i c-j) (- sub-i (* ins-cost (count c-j-desc)) (* del-cost (count c-i-desc))) (cond (or (not (.hasChildNodes c-i)) (not (.hasChildNodes c-j))) (+ sub-i sub-cost) (and (= (.getNodeName c-i) (.getNodeName c-j)) (try (= (.getNodeValue (.getNamedItem (.getAttributes c-i) "id")) (.getNodeValue (.getNamedItem (.getAttributes c-j) "id"))) (catch Exception e true)) (try (= (.getNodeValue (.getNamedItem (.getAttributes c-i) "class")) (.getNodeValue (.getNamedItem (.getAttributes c-j) "class"))) (catch Exception e true))) (- sub-i (invert-cost c-i c-j del-cost ins-cost sub-cost)) :else (+ sub-i sub-cost)))] (aset M (inc i) (inc j) (int (min del ins sub))))) (aget M m n))) (defn rtdm-edit-distance-sim [tree-1 tree-2 del-cost ins-cost sub-cost] (let [t1-desc (tree-descendants tree-1) t2-desc (tree-descendants tree-2)] (- 1 (/ (rtdm-edit-distance tree-1 tree-2 del-cost ins-cost sub-cost) (+ (* (+ (count t1-desc) 1) del-cost) (* (+ (count t2-desc) 1) sub-cost)))))) (defn get-xml-tree-body "Downloads a webpage and converts it to an org.w3.dom.Document" [page-src] (let [cleaner (new HtmlCleaner) props (.getProperties cleaner) cleaner-props (new CleanerProperties) dom-serializer (new DomSerializer cleaner-props) tag-node (.clean cleaner page-src)] (.createDOM dom-serializer tag-node))) (defn rtdm-edit-distance-html [pg1 pg2 del-cost ins-cost sub-cost] (rtdm-edit-distance-sim (get-xml-tree-body pg1) (get-xml-tree-body pg2) del-cost ins-cost sub-cost)) 

## A Comment on Dimension-Estimation

I saw this neat comment in a paper I was recently reading. If you have all i.i.d features and you want to estimate its dimension using Grassberger-Procaccia (which estimates dimension using a distance-based metric) or want to classify using a k-NN classifier, it is bad if the data points are mostly pairwise equidistant (for instance, a correlation integral plot will look like a step function and thus will be useless; a k-NN classifier will break because the test point ends up equidistant from all the existing points).

There is a trivial argument using the Hoeffding bound in Chris Burges’ paper that suggests that if the features are all i.i.d, a majority of pairwise distances will end up clustered tightly around a mean which means that k-NN or Grassberger-Procaccia won’t work well. I am going to repeat this argument here so I can remember it for later:

Our vectors are of dimension $d$ and the components are $\pm1$. Assuming all the components are $iid$, the Hoeffding bound gives us:

$$P(||| x_{1} - x_{2} ||^{2} – 2d| > d\epsilon) = P(| x_{1} \cdot x_{2} | > d\epsilon/2) \le 2exp(-\frac{d\epsilon^2}{8})$$

and this shows us that most pairwise distances will end up clustered very tightly around a mean and this means that a majority of pairs of points in the dataset will end up equidistant and thus a $k-NN$ classifier will fail.

This also means that the correlation integral is a good way to determine if a k-NN classifier will work well. If the plot resembles a spike, the distance function needs to change.

The correlation-integral is an immensely powerful tool and here’s an implementation