The TW-IDF Model

I found this recent paper on adhoc-IR from CIKM 2013. I haven’t worked in this area but the results seem promising. Essentially, instead of TF, you store an indegree count. The graph in question is the term co-occurrence graph within a window with the direction indicating word-order. The paper won hon. mention at the CIKM so it is clearly very cool.

Link to paper


Modifying The Heritrix Web Crawler

This is a post I wrote to teach myself about Heritrix and modifying it. There are solid motivations for modifying web-crawlers (say we know how to beat a simple BFS for some specific website). In this post, I will modify a routine that is central to web-crawling - extracting URLs from a webpage.


Clojure Blekko API

Back when Blekko had an API, I threw this implementation of their API together for a small project I was working on. You can only query blekko with this (the API allows slashtag manipulation and all that which I didn’t bother with).

Anyway, you can do this:

Usage

With leiningen:

[clj_blekko "0.1.0"]

With maven:

<dependency>
  <groupId>clj_blekko</groupId>
  <artifactId>clj_blekko</artifactId>
  <version>0.1.0</version>
</dependency>

You can just do this:

user> (use 'clj-blekko.core :reload)

user> (clojure.pprint/pprint (:RESULT (run-query "shriphani" api-key :json true :page 2)))
[{:c 1,
  :display_url
  "kindle.amazon.com/profile/Werner-Vogels/160/followers/65",
  :n_group 41,
  :short_host "kindle.amazon.com",
  :short_host_url "http://kindle.amazon.com/",
  :snippet
  "1 Followers 0 Books with Public Notes.  Peter van der Reijden.  5 Followers 0 Books with Public Notes.  4 Followers 0 Books with Public Notes.  0 Followers 0 Books with Public Notes.",
  :url
  "https://kindle.amazon.com/profile/Werner-Vogels/160/followers/65",
  :url_title "Amazon Kindle - Werner Vogels - Followers"}
 {:c 2,
  :display_url "yelp.com/biz/M8P96GmGImU0KOrFMfVCKw",
  :n_group 42,
  :short_host "yelp.com",
  :short_host_url "http://www.yelp.com/",
  :snippet
  "33 reviews for Blue Nile Restaurant.  Blue Nile Restaurant.  117 Northwestern Ave Ste 2 West Lafayette, IN 47906.  Mon-Sat 11 am - 11 pm.  Sun 12 pm - 10 pm.",
  :url "http://www.yelp.com/biz/M8P96GmGImU0KOrFMfVCKw",
  :url_title
  "<strong>Yelp.com</strong> - Blue Nile Restaurant - West Lafayette, IN"}
 {:c 3,
  :display_url "yelp.com/biz/blue-ni
le-restaurant-west-lafayette",
  :n_group 43,
  :short_host "yelp.com",
  :short_host_url "http://www.yelp.com/",
  :snippet
  "34 reviews for Blue Nile Restaurant.  Blue Nile Restaurant.  117 Northwestern Ave Ste 2 West Lafayette, IN 47906.  Mon-Sat 11 am - 11 pm.  Sun 12 pm - 10 pm.",
  :url "http://www.yelp.com/biz/blue-nile-restaurant-west-lafayette",
  :url_title
  "<strong>Yelp.com</strong> - Blue Nile Restaurant - West Lafayette, IN"}
 {:c 4,
  :display_url "amazon.com/gp/product/0679776222?link_code=as3",
  :n_group 44,
  :short_host "amazon.com",
  :short_host_url "http://www.amazon.com/",
  :snippet
  "Frequently Bought Together.  Customers Who Bought This Item Also Bought.  More About the Authors.  Very Bad Poetry Paperback.  Very Bad Poetry (Vintage) and over one million other books are available for Amazon Kindle.",
  :url "http://www.amazon.com/gp/product/0679776222?link_code=as3",
  :url_title
  "<strong>Amazon.com</strong> - Very Bad Poetry - Ross Petras, Kathryn Petras - 9780679776222 - Amazo
n.com - Books"}
 {:c 5,
  :display_url "yelp.com/biz/shaukin-indian-fast-food-west-lafayette",
  :n_group 45,
  :short_host "yelp.com",
  :short_host_url "http://www.yelp.com/",
  :snippet
  "23 reviews for Shaukin Indian Fast Food.  138 S River Rd West Lafayette, IN 47906.  Tue-Thu 4 pm - 10 pm.  Fri-Sun 12 pm - 10 pm.  Good for Kids.",
  :url
  "http://www.yelp.com/biz/shaukin-indian-fast-food-west-lafayette",
  :url_title
  "<strong>Yelp.com</strong> - Shaukin Indian Fast Food - West Lafayette, IN"}
 {:c 6,
  :display_url
  "reddit.com/.../who_here_doesnt_sympathize_with_g20_rioters",
  :n_group 46,
  :short_host "reddit.com",
  :short_host_url "http://www.reddit.com/",
  :snippet
  "Login or register in seconds.  Limit my search to /r/worldnews.  Use the following search parameters to narrow your results.  Search for &quot;text&quot; in url.  Search for &quot;text&quot; in self post contents.",
  :url
  "http://www.reddit.com/r/worldnews/comments/cjhse/who_here_doesnt_sympathize_with_g20_rioters/",
  :url_title

  "<strong>Too Many Requests</strong> - Who here doesn&#39;t sympathize with G20 rioters destroying people&#39;s property - worldnews"}
 {:c 7,
  :display_url "yelp.ie/biz/shaukin-indian-fast-food-west-lafayette",
  :n_group 47,
  :short_host "yelp.ie",
  :short_host_url "http://www.yelp.ie/",
  :snippet
  "Recommended Reviews for Shaukin Indian Fast Food.  Cookies help us deliver our services.  By using our services, you agree to our use of cookies.  138 S River Rd West Lafayette, IN 47906.  Good for Children.",
  :url
  "http://www.yelp.ie/biz/shaukin-indian-fast-food-west-lafayette",
  :url_title "Shaukin Indian Fast Food - West Lafayette, IN"}
 {:c 8,
  :display_url
  "en.yelp.be/biz/shaukin-indian-fast-food-west-lafayette",
  :n_group 48,
  :short_host "en.yelp.be",
  :short_host_url "http://en.yelp.be/",
  :snippet
  "25 reviews for Shaukin Indian Fast Food.  138 S River Rd West Lafayette, IN 47906.  Good for Children.  Accepts Credit Cards.  Good for Groups.",
  :url "http://en.yelp.be/biz/shaukin-ind
ian-fast-food-west-lafayette",
  :url_title "Shaukin Indian Fast Food - West Lafayette, IN"}
 {:c 9,
  :display_url "quora.com/Colin-Ho",
  :n_group 49,
  :short_host "quora.com",
  :short_host_url "http://www.quora.com/",
  :snippet
  "You must sign in to read Quora past the first answer.  Login to Quora.  Complete your account on Quora.  There are some updates to this page that haven&#39;t been applied yet because you&#39;ve entered some data into a form.  Refresh this page to receive new updates.",
  :url "http://www.quora.com/Colin-Ho",
  :url_title "Colin Ho - <strong>Quora</strong>"}
 {:c 10,
  :display_url
  "quora.com/...change-the-world-the-most-within-the-next-25-years",
  :n_group 50,
  :short_host "quora.com",
  :short_host_url "http://www.quora.com/",
  :snippet
  "You must sign in to read past the first answer.  Complete Your Profile.  Login to Quora.  You must sign in to read all of Quora.  You must be signed in to read this answer.",
  :url
  "http://www.quora.com/Which-technological-innovatio
n-will-change-the-world-the-most-within-the-next-25-years",
  :url_title
  "<strong>Quora</strong> - Which technological innovation will change the world the most within the next 25 years - Quora"}
 {:display_url "meetup.com/Clojure-PGH",
  :short_host "meetup.com",
  :c 11,
  :url_title
  "<strong>Meetup</strong> - Pittsburgh Clojure Users Group (Pittsburgh, PA) - Meetup",
  :n_group 51,
  :doc_date "Mar 2010",
  :url "http://www.meetup.com/Clojure-PGH/",
  :short_host_url "http://www.meetup.com/",
  :snippet
  "Welcome old lispers and new schemers.  Come to our next event to meet other programmers interested in the latest secret alien technology.  We are building a community of people who want to learn from and teach others about Clojure.  Talking with printed notes is encouraged, powerpoints are forbidden.",
  :doc_date_iso "2010-03-04 00:00:00"}
 {:c 12,
  :display_url
  "quora.com/.../How-do-you-ensure-that-TAs-for-introductory-CS...",
  :n_group 52,
  :short_host "quora.com",
  :short_host_url "http://w
ww.quora.com/",
  :snippet
  "You must sign in to read past the first answer.  Complete Your Profile.  Login to Quora.  You must sign in to read all of Quora.  You must be signed in to read this answer.",
  :url
  "http://www.quora.com/Computer-Science-Education/How-do-you-ensure-that-TAs-for-introductory-CS-classes-teach-at-a-high-quality",
  :url_title
  "<strong>Quora</strong> - Computer Science Education - How do you ensure that TAs for introductory CS classes teach at ... "}
 {:c 13,
  :display_url
  "quora.com/.../Is-it-already-too-late-to-get-on-the-wave-of-f...",
  :n_group 53,
  :short_host "quora.com",
  :short_host_url "http://www.quora.com/",
  :snippet
  "If I were to start learning Scala and the functional programming paradigm (coming from imperative languages), am I getting early or late to the party.  I&#39;m a functional learner myself.",
  :url
  "http://www.quora.com/Scala/Is-it-already-too-late-to-get-on-the-wave-of-functional-programming-and-Scala",
  :url_title
  "<strong>Quora</strong> 
- Is it already too late to get on the wave of functional programming and Scala"}
 {:c 14,
  :display_url "python.org/doc/3.0.1/about.html",
  :n_group 54,
  :short_host "python.org",
  :short_host_url "http://www.python.org/",
  :snippet
  "Contributors to the Python Documentation.  About these documents.  These documents are generated from reStructuredText sources by Sphinx, a document processor specifically written for the Python documentation.",
  :url "http://www.python.org/doc/3.0.1/about.html",
  :url_title
  "<strong>Python Programming Language</strong> - About these documents — Python v3.0.1 documentation"}]

Anyway, thought it might be useful (even though blekko’s gone and nuked free access to their API).


All It Took Was An AHA!

I can clearly recall why I became a computer scientist. I was sitting in a class and we were discussing how cons was implemented. And then I saw this definition:

1
2
3
4
5
6
7
8
9
(define (cons a b)
    (lambda (x)
        (if (= x 1) a b)))

(define (first l)
    (l 1))

(define (rest l)
    (l 2))

The lambda calculus and the material in the little schemer kept me in the field (computer-science i.e.) and assured me that there would never be a dearth of aha! moments in my education.

Good educators can deliver such aha! moments in every single lecture. A good textbook can do it several times each chapter.

I have since tried to find material that delivers such aha! moments.

Hopefully, I will encounter them for the rest of my life in whatever I do.


Robust Principal Component Pursuit - Background Matrix Recovery

I recently spent some time working on a simple linear algebra problem - decompose a matrix $ M $ into a low-rank component $ L $ and a sparse component $ S $. The algorithm I used was very trivial to implement (and parallelize using map-reduce).

In this post, I will implement this very simple algorithm, explain the objective function and demonstrate its (amazing) effectiveness on a surveillance-camera dataset.


Tree Edit Distance in Clojure

I had to throw together an implementation of tree-edit distance for clustering web-pages based on their structure. It performs reasonably quickly. The algorithm itself. The repo is here.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
(ns tree-edit-distance.core
  (:use [clojure.pprint])
  (:import (org.htmlcleaner HtmlCleaner DomSerializer CleanerProperties)
           (org.w3c.dom Document)))

(defn init
  "Perform the correct initialization"
  [m n c1 c2 del-cost ins-cost]
  (let [M (make-array Integer/TYPE (inc m) (inc n))]
    (do
      (doseq [i (range (inc m))
              j (range (inc n))]
        (aset M i j (int
                     (+ (* del-cost c1)
                        (* ins-cost c2)))))
      M)))

(defn num-children
  "Expects a html tree"
  [a-tree]
  (if(.hasChildNodes a-tree)
    (.getLength (.getChildNodes a-tree)) 0))

(defn tree-children
  "Return level 1 children"
  [a-tree]
  (let [n  (num-children a-tree)
        cs (.getChildNodes a-tree)]
    (map
     #(.item cs %)
     (range n))))

(defn tree-descendants
  [a-tree]
  (if (.hasChildNodes a-tree)
    (concat (tree-children a-tree)
            (flatten (map tree-descendants (tree-children a-tree))))
    []))

(declare rtdm-edit-distance)

(defn invert-cost
  [t1 t2 del-cost ins-cost sub-cost]
  (let [t1-desc (tree-descendants t1)
        t2-desc (tree-descendants t2)]
    (- (+ (* del-cost (count t1-desc))
          (* ins-cost (count t2-desc)))
       (rtdm-edit-distance t1 t2 del-cost ins-cost sub-cost))))

(defn rtdm-edit-distance
  "The RTDM algorithm for computing edit-distance.
   The trees are assumed to be org.w3c.dom.Documents"
  [tree-1 tree-2 del-cost ins-cost sub-cost]

  (let [m (num-children tree-1)
        n (num-children tree-2)

        t1-children (tree-children tree-1)
        t2-children (tree-children tree-2)

        t1-desc (tree-descendants tree-1)
        t2-desc (tree-descendants tree-2)

        M (init m n (count t1-desc) (count t2-desc) del-cost ins-cost)]

    (doseq [i (range m)
            j (range n)]
      (let [c-i (nth t1-children i)
            c-j (nth t2-children j)

            c-i-desc (tree-descendants c-i)
            c-j-desc (tree-descendants c-j)

            del (aget M i (inc j))
            ins (aget M (inc i) j)

            sub-i (- (aget M i j)
                     del-cost
                     ins-cost)

            sub (if (.isEqualNode c-i c-j)
                  (- sub-i
                     (* ins-cost (count c-j-desc))
                     (* del-cost (count c-i-desc)))
                  (cond
                   (or (not (.hasChildNodes c-i))
                       (not (.hasChildNodes c-j)))
                   (+ sub-i sub-cost)

                   (and (= (.getNodeName c-i) (.getNodeName c-j))
                        (try
                          (= (.getNodeValue (.getNamedItem (.getAttributes c-i) "id"))
                             (.getNodeValue (.getNamedItem (.getAttributes c-j) "id")))
                          (catch Exception e true))
                        (try
                          (= (.getNodeValue (.getNamedItem (.getAttributes c-i) "class"))
                             (.getNodeValue (.getNamedItem (.getAttributes c-j) "class")))
                          (catch Exception e true)))
                   (- sub-i (invert-cost c-i c-j del-cost ins-cost sub-cost))

                   :else
                   (+ sub-i sub-cost)))]
        (aset M (inc i) (inc j) (int (min del ins sub)))))
    (aget M m n)))

(defn rtdm-edit-distance-sim
  [tree-1 tree-2 del-cost ins-cost sub-cost]
  (let [t1-desc (tree-descendants tree-1)
        t2-desc (tree-descendants tree-2)]
    (- 1
       (/ (rtdm-edit-distance tree-1 tree-2 del-cost ins-cost sub-cost)
          (+ (* (+ (count t1-desc) 1) del-cost)
             (* (+ (count t2-desc) 1) sub-cost))))))

(defn get-xml-tree-body
  "Downloads a webpage and converts it to an org.w3.dom.Document"
  [page-src]

  (let [cleaner        (new HtmlCleaner)
        props          (.getProperties cleaner)
        cleaner-props  (new CleanerProperties)
        dom-serializer (new DomSerializer cleaner-props)
        tag-node       (.clean cleaner page-src)]

    (.createDOM dom-serializer tag-node)))

(defn rtdm-edit-distance-html
  [pg1 pg2 del-cost ins-cost sub-cost]
  (rtdm-edit-distance-sim
   (get-xml-tree-body pg1)
   (get-xml-tree-body pg2)
   del-cost
   ins-cost
   sub-cost))


Per Intellectum, Vis
(c) Shriphani Palakodety 2013-2016