Augmenting enlive

In manipulating HTML documents for features, I find myself needing to use some operations all the time - removing script tags, comments and the like. This feature-set is available in HtmlCleaner and I thus merged the two libraries to produce enlive-helper.

Now you can do:

1
2
3
(html-resource-steroids 
 (java.io.StringReader. "<html><body><a>hi</a></body></html>") 
 :prune-tags "a")

And as a result the a tag is not picked up:

1
2
3
4
5
6
7
({:tag :html,
  :attrs nil,
  :content
  ("\n"
   {:tag :head, :attrs nil, :content nil}
   "\n"
   {:tag :body, :attrs nil, :content nil})})

The options you can pass mirror those of HtmlCleaner. Full docs available in this github repo.

Also, the code is something I threw together from my research so it is released under Matt Might’s CRAPL license.


Consistent Hashing in Clojure

I wrote this post to teach myself consistent hashing - a simple hash family that Akamai’s founders came up with. This was originally done to prepare for a talk in my grad algorithms class (I made a horlicks of the talk but whatever). I am going to provide intuition, analysis and a clojure implementation.


Clojure/Java String trim

Java’s string trim routine tests for whitespace using Character.isWhitespace and so does Clojure’s clojure.string/trim.

While processing a dataset off the web containing unicode space characters, the trim routine failed to do anything useful. Luckily, a StackOverflow thread suggested using a routine from Google’s guava library. So in Clojure, you can do this:

1
(.trimFrom CharMatcher/WHITESPACE %)

and this will do the job.


Clojure Blekko API

Back when Blekko had an API, I threw this implementation of their API together for a small project I was working on. You can only query blekko with this (the API allows slashtag manipulation and all that which I didn’t bother with).

Anyway, you can do this:

Usage

With leiningen:

[clj_blekko "0.1.0"]

With maven:

<dependency>
  <groupId>clj_blekko</groupId>
  <artifactId>clj_blekko</artifactId>
  <version>0.1.0</version>
</dependency>

You can just do this:

user> (use 'clj-blekko.core :reload)

user> (clojure.pprint/pprint (:RESULT (run-query "shriphani" api-key :json true :page 2)))
[{:c 1,
  :display_url
  "kindle.amazon.com/profile/Werner-Vogels/160/followers/65",
  :n_group 41,
  :short_host "kindle.amazon.com",
  :short_host_url "http://kindle.amazon.com/",
  :snippet
  "1 Followers 0 Books with Public Notes.  Peter van der Reijden.  5 Followers 0 Books with Public Notes.  4 Followers 0 Books with Public Notes.  0 Followers 0 Books with Public Notes.",
  :url
  "https://kindle.amazon.com/profile/Werner-Vogels/160/followers/65",
  :url_title "Amazon Kindle - Werner Vogels - Followers"}
 {:c 2,
  :display_url "yelp.com/biz/M8P96GmGImU0KOrFMfVCKw",
  :n_group 42,
  :short_host "yelp.com",
  :short_host_url "http://www.yelp.com/",
  :snippet
  "33 reviews for Blue Nile Restaurant.  Blue Nile Restaurant.  117 Northwestern Ave Ste 2 West Lafayette, IN 47906.  Mon-Sat 11 am - 11 pm.  Sun 12 pm - 10 pm.",
  :url "http://www.yelp.com/biz/M8P96GmGImU0KOrFMfVCKw",
  :url_title
  "<strong>Yelp.com</strong> - Blue Nile Restaurant - West Lafayette, IN"}
 {:c 3,
  :display_url "yelp.com/biz/blue-ni
le-restaurant-west-lafayette",
  :n_group 43,
  :short_host "yelp.com",
  :short_host_url "http://www.yelp.com/",
  :snippet
  "34 reviews for Blue Nile Restaurant.  Blue Nile Restaurant.  117 Northwestern Ave Ste 2 West Lafayette, IN 47906.  Mon-Sat 11 am - 11 pm.  Sun 12 pm - 10 pm.",
  :url "http://www.yelp.com/biz/blue-nile-restaurant-west-lafayette",
  :url_title
  "<strong>Yelp.com</strong> - Blue Nile Restaurant - West Lafayette, IN"}
 {:c 4,
  :display_url "amazon.com/gp/product/0679776222?link_code=as3",
  :n_group 44,
  :short_host "amazon.com",
  :short_host_url "http://www.amazon.com/",
  :snippet
  "Frequently Bought Together.  Customers Who Bought This Item Also Bought.  More About the Authors.  Very Bad Poetry Paperback.  Very Bad Poetry (Vintage) and over one million other books are available for Amazon Kindle.",
  :url "http://www.amazon.com/gp/product/0679776222?link_code=as3",
  :url_title
  "<strong>Amazon.com</strong> - Very Bad Poetry - Ross Petras, Kathryn Petras - 9780679776222 - Amazo
n.com - Books"}
 {:c 5,
  :display_url "yelp.com/biz/shaukin-indian-fast-food-west-lafayette",
  :n_group 45,
  :short_host "yelp.com",
  :short_host_url "http://www.yelp.com/",
  :snippet
  "23 reviews for Shaukin Indian Fast Food.  138 S River Rd West Lafayette, IN 47906.  Tue-Thu 4 pm - 10 pm.  Fri-Sun 12 pm - 10 pm.  Good for Kids.",
  :url
  "http://www.yelp.com/biz/shaukin-indian-fast-food-west-lafayette",
  :url_title
  "<strong>Yelp.com</strong> - Shaukin Indian Fast Food - West Lafayette, IN"}
 {:c 6,
  :display_url
  "reddit.com/.../who_here_doesnt_sympathize_with_g20_rioters",
  :n_group 46,
  :short_host "reddit.com",
  :short_host_url "http://www.reddit.com/",
  :snippet
  "Login or register in seconds.  Limit my search to /r/worldnews.  Use the following search parameters to narrow your results.  Search for &quot;text&quot; in url.  Search for &quot;text&quot; in self post contents.",
  :url
  "http://www.reddit.com/r/worldnews/comments/cjhse/who_here_doesnt_sympathize_with_g20_rioters/",
  :url_title

  "<strong>Too Many Requests</strong> - Who here doesn&#39;t sympathize with G20 rioters destroying people&#39;s property - worldnews"}
 {:c 7,
  :display_url "yelp.ie/biz/shaukin-indian-fast-food-west-lafayette",
  :n_group 47,
  :short_host "yelp.ie",
  :short_host_url "http://www.yelp.ie/",
  :snippet
  "Recommended Reviews for Shaukin Indian Fast Food.  Cookies help us deliver our services.  By using our services, you agree to our use of cookies.  138 S River Rd West Lafayette, IN 47906.  Good for Children.",
  :url
  "http://www.yelp.ie/biz/shaukin-indian-fast-food-west-lafayette",
  :url_title "Shaukin Indian Fast Food - West Lafayette, IN"}
 {:c 8,
  :display_url
  "en.yelp.be/biz/shaukin-indian-fast-food-west-lafayette",
  :n_group 48,
  :short_host "en.yelp.be",
  :short_host_url "http://en.yelp.be/",
  :snippet
  "25 reviews for Shaukin Indian Fast Food.  138 S River Rd West Lafayette, IN 47906.  Good for Children.  Accepts Credit Cards.  Good for Groups.",
  :url "http://en.yelp.be/biz/shaukin-ind
ian-fast-food-west-lafayette",
  :url_title "Shaukin Indian Fast Food - West Lafayette, IN"}
 {:c 9,
  :display_url "quora.com/Colin-Ho",
  :n_group 49,
  :short_host "quora.com",
  :short_host_url "http://www.quora.com/",
  :snippet
  "You must sign in to read Quora past the first answer.  Login to Quora.  Complete your account on Quora.  There are some updates to this page that haven&#39;t been applied yet because you&#39;ve entered some data into a form.  Refresh this page to receive new updates.",
  :url "http://www.quora.com/Colin-Ho",
  :url_title "Colin Ho - <strong>Quora</strong>"}
 {:c 10,
  :display_url
  "quora.com/...change-the-world-the-most-within-the-next-25-years",
  :n_group 50,
  :short_host "quora.com",
  :short_host_url "http://www.quora.com/",
  :snippet
  "You must sign in to read past the first answer.  Complete Your Profile.  Login to Quora.  You must sign in to read all of Quora.  You must be signed in to read this answer.",
  :url
  "http://www.quora.com/Which-technological-innovatio
n-will-change-the-world-the-most-within-the-next-25-years",
  :url_title
  "<strong>Quora</strong> - Which technological innovation will change the world the most within the next 25 years - Quora"}
 {:display_url "meetup.com/Clojure-PGH",
  :short_host "meetup.com",
  :c 11,
  :url_title
  "<strong>Meetup</strong> - Pittsburgh Clojure Users Group (Pittsburgh, PA) - Meetup",
  :n_group 51,
  :doc_date "Mar 2010",
  :url "http://www.meetup.com/Clojure-PGH/",
  :short_host_url "http://www.meetup.com/",
  :snippet
  "Welcome old lispers and new schemers.  Come to our next event to meet other programmers interested in the latest secret alien technology.  We are building a community of people who want to learn from and teach others about Clojure.  Talking with printed notes is encouraged, powerpoints are forbidden.",
  :doc_date_iso "2010-03-04 00:00:00"}
 {:c 12,
  :display_url
  "quora.com/.../How-do-you-ensure-that-TAs-for-introductory-CS...",
  :n_group 52,
  :short_host "quora.com",
  :short_host_url "http://w
ww.quora.com/",
  :snippet
  "You must sign in to read past the first answer.  Complete Your Profile.  Login to Quora.  You must sign in to read all of Quora.  You must be signed in to read this answer.",
  :url
  "http://www.quora.com/Computer-Science-Education/How-do-you-ensure-that-TAs-for-introductory-CS-classes-teach-at-a-high-quality",
  :url_title
  "<strong>Quora</strong> - Computer Science Education - How do you ensure that TAs for introductory CS classes teach at ... "}
 {:c 13,
  :display_url
  "quora.com/.../Is-it-already-too-late-to-get-on-the-wave-of-f...",
  :n_group 53,
  :short_host "quora.com",
  :short_host_url "http://www.quora.com/",
  :snippet
  "If I were to start learning Scala and the functional programming paradigm (coming from imperative languages), am I getting early or late to the party.  I&#39;m a functional learner myself.",
  :url
  "http://www.quora.com/Scala/Is-it-already-too-late-to-get-on-the-wave-of-functional-programming-and-Scala",
  :url_title
  "<strong>Quora</strong> 
- Is it already too late to get on the wave of functional programming and Scala"}
 {:c 14,
  :display_url "python.org/doc/3.0.1/about.html",
  :n_group 54,
  :short_host "python.org",
  :short_host_url "http://www.python.org/",
  :snippet
  "Contributors to the Python Documentation.  About these documents.  These documents are generated from reStructuredText sources by Sphinx, a document processor specifically written for the Python documentation.",
  :url "http://www.python.org/doc/3.0.1/about.html",
  :url_title
  "<strong>Python Programming Language</strong> - About these documents — Python v3.0.1 documentation"}]

Anyway, thought it might be useful (even though blekko’s gone and nuked free access to their API).


Tree Edit Distance in Clojure

I had to throw together an implementation of tree-edit distance for clustering web-pages based on their structure. It performs reasonably quickly. The algorithm itself. The repo is here.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
(ns tree-edit-distance.core
  (:use [clojure.pprint])
  (:import (org.htmlcleaner HtmlCleaner DomSerializer CleanerProperties)
           (org.w3c.dom Document)))

(defn init
  "Perform the correct initialization"
  [m n c1 c2 del-cost ins-cost]
  (let [M (make-array Integer/TYPE (inc m) (inc n))]
    (do
      (doseq [i (range (inc m))
              j (range (inc n))]
        (aset M i j (int
                     (+ (* del-cost c1)
                        (* ins-cost c2)))))
      M)))

(defn num-children
  "Expects a html tree"
  [a-tree]
  (if(.hasChildNodes a-tree)
    (.getLength (.getChildNodes a-tree)) 0))

(defn tree-children
  "Return level 1 children"
  [a-tree]
  (let [n  (num-children a-tree)
        cs (.getChildNodes a-tree)]
    (map
     #(.item cs %)
     (range n))))

(defn tree-descendants
  [a-tree]
  (if (.hasChildNodes a-tree)
    (concat (tree-children a-tree)
            (flatten (map tree-descendants (tree-children a-tree))))
    []))

(declare rtdm-edit-distance)

(defn invert-cost
  [t1 t2 del-cost ins-cost sub-cost]
  (let [t1-desc (tree-descendants t1)
        t2-desc (tree-descendants t2)]
    (- (+ (* del-cost (count t1-desc))
          (* ins-cost (count t2-desc)))
       (rtdm-edit-distance t1 t2 del-cost ins-cost sub-cost))))

(defn rtdm-edit-distance
  "The RTDM algorithm for computing edit-distance.
   The trees are assumed to be org.w3c.dom.Documents"
  [tree-1 tree-2 del-cost ins-cost sub-cost]

  (let [m (num-children tree-1)
        n (num-children tree-2)

        t1-children (tree-children tree-1)
        t2-children (tree-children tree-2)

        t1-desc (tree-descendants tree-1)
        t2-desc (tree-descendants tree-2)

        M (init m n (count t1-desc) (count t2-desc) del-cost ins-cost)]

    (doseq [i (range m)
            j (range n)]
      (let [c-i (nth t1-children i)
            c-j (nth t2-children j)

            c-i-desc (tree-descendants c-i)
            c-j-desc (tree-descendants c-j)

            del (aget M i (inc j))
            ins (aget M (inc i) j)

            sub-i (- (aget M i j)
                     del-cost
                     ins-cost)

            sub (if (.isEqualNode c-i c-j)
                  (- sub-i
                     (* ins-cost (count c-j-desc))
                     (* del-cost (count c-i-desc)))
                  (cond
                   (or (not (.hasChildNodes c-i))
                       (not (.hasChildNodes c-j)))
                   (+ sub-i sub-cost)

                   (and (= (.getNodeName c-i) (.getNodeName c-j))
                        (try
                          (= (.getNodeValue (.getNamedItem (.getAttributes c-i) "id"))
                             (.getNodeValue (.getNamedItem (.getAttributes c-j) "id")))
                          (catch Exception e true))
                        (try
                          (= (.getNodeValue (.getNamedItem (.getAttributes c-i) "class"))
                             (.getNodeValue (.getNamedItem (.getAttributes c-j) "class")))
                          (catch Exception e true)))
                   (- sub-i (invert-cost c-i c-j del-cost ins-cost sub-cost))

                   :else
                   (+ sub-i sub-cost)))]
        (aset M (inc i) (inc j) (int (min del ins sub)))))
    (aget M m n)))

(defn rtdm-edit-distance-sim
  [tree-1 tree-2 del-cost ins-cost sub-cost]
  (let [t1-desc (tree-descendants tree-1)
        t2-desc (tree-descendants tree-2)]
    (- 1
       (/ (rtdm-edit-distance tree-1 tree-2 del-cost ins-cost sub-cost)
          (+ (* (+ (count t1-desc) 1) del-cost)
             (* (+ (count t2-desc) 1) sub-cost))))))

(defn get-xml-tree-body
  "Downloads a webpage and converts it to an org.w3.dom.Document"
  [page-src]

  (let [cleaner        (new HtmlCleaner)
        props          (.getProperties cleaner)
        cleaner-props  (new CleanerProperties)
        dom-serializer (new DomSerializer cleaner-props)
        tag-node       (.clean cleaner page-src)]

    (.createDOM dom-serializer tag-node)))

(defn rtdm-edit-distance-html
  [pg1 pg2 del-cost ins-cost sub-cost]
  (rtdm-edit-distance-sim
   (get-xml-tree-body pg1)
   (get-xml-tree-body pg2)
   del-cost
   ins-cost
   sub-cost))

Accessing Your Kindle Highlights

In 2010, I purchased my first Kindle and since then apart from GEB [1], I haven’t bothered with physical copies. The Kindle store satisfies most of my needs (I find situations where the paperback costs less than the digital copy and refuse to buy the book on principle).

The books can be read on any platform (OS X, iOS for iPad and iPhone in my case and I do remember a rather unpleasant Kindle app on WP7)

One of the benefits of a digital book is that it should be straightforward for me to collect a list of highlights I’ve made about the book. Amazon (in their infinite wisdom) have not provided an API in the 3 or so years I’ve used the Kindle ecosystem and manually transcribing the quotes is not something I am interested in doing. Scraping remains the only alternative. I decided to use clojure for this task.



Per Intellectum, Vis
(c) Shriphani Palakodety 2013-2016