SHRIPHANI PALAKODETY (page 10)

HOME
ABOUT
|
|
ART
|
TEACHING
|
RSS

Probabilistic Counting

2013-12-13

loglog, flajolet, probabilistic-counting, clojure

I recently got my hands on the common-crawl URL dataset from here and wanted to compute some stats like the number of unique domains contained in the corpus. This was the perfect opportunity to whip up a script that implemented Durande et. al’s log-log paper.

Tree Edit Distance Enlive Version

2013-12-09

tree-edit-distance, research, enlive

In my last post, I presented a code-dump that computed a restricted version of the tree edit distance algorithm. I was able to achieve a decent speed-up using enlive. Here’s a code-dump:

Tree Edit Distance in Clojure

2013-12-03

research, dom, trees, edit-distance, dynamic-programming, clojure

I had to throw together an implementation of tree-edit distance for clustering web-pages based on their structure. It performs reasonably quickly. The algorithm itself. The repo is here.

(ns tree-edit-distance.core
  (:use [clojure.pprint])
  (:import (org.htmlcleaner HtmlCleaner DomSerializer CleanerProperties)
           (org.w3c.dom Document)))

(defn init
  "Perform the correct initialization"
  [m n c1 c2 del-cost ins-cost]
  (let [M (make-array Integer/TYPE (inc m) (inc n))]
    (do
      (doseq [i (range (inc m))
              j (range (inc n))]
        (aset M i j (int
                     (+ (* del-cost c1)
                        (* ins-cost c2)))))
      M)))

(defn num-children
  "Expects a html tree"
  [a-tree]
  (if(.hasChildNodes a-tree)
    (.getLength (.getChildNodes a-tree)) 0))

(defn tree-children
  "Return level 1 children"
  [a-tree]
  (let [n  (num-children a-tree)
        cs (.getChildNodes a-tree)]
    (map
     #(.item cs %)
     (range n))))

(defn tree-descendants
  [a-tree]
  (if (.hasChildNodes a-tree)
    (concat (tree-children a-tree)
            (flatten (map tree-descendants (tree-children a-tree))))
    []))

(declare rtdm-edit-distance)

(defn invert-cost
  [t1 t2 del-cost ins-cost sub-cost]
  (let [t1-desc (tree-descendants t1)
        t2-desc (tree-descendants t2)]
    (- (+ (* del-cost (count t1-desc))
          (* ins-cost (count t2-desc)))
       (rtdm-edit-distance t1 t2 del-cost ins-cost sub-cost))))

(defn rtdm-edit-distance
  "The RTDM algorithm for computing edit-distance.
   The trees are assumed to be org.w3c.dom.Documents"
  [tree-1 tree-2 del-cost ins-cost sub-cost]

  (let [m (num-children tree-1)
        n (num-children tree-2)

        t1-children (tree-children tree-1)
        t2-children (tree-children tree-2)

        t1-desc (tree-descendants tree-1)
        t2-desc (tree-descendants tree-2)

        M (init m n (count t1-desc) (count t2-desc) del-cost ins-cost)]

    (doseq [i (range m)
            j (range n)]
      (let [c-i (nth t1-children i)
            c-j (nth t2-children j)

            c-i-desc (tree-descendants c-i)
            c-j-desc (tree-descendants c-j)

            del (aget M i (inc j))
            ins (aget M (inc i) j)

            sub-i (- (aget M i j)
                     del-cost
                     ins-cost)

            sub (if (.isEqualNode c-i c-j)
                  (- sub-i
                     (* ins-cost (count c-j-desc))
                     (* del-cost (count c-i-desc)))
                  (cond
                   (or (not (.hasChildNodes c-i))
                       (not (.hasChildNodes c-j)))
                   (+ sub-i sub-cost)

                   (and (= (.getNodeName c-i) (.getNodeName c-j))
                        (try
                          (= (.getNodeValue (.getNamedItem (.getAttributes c-i) "id"))
                             (.getNodeValue (.getNamedItem (.getAttributes c-j) "id")))
                          (catch Exception e true))
                        (try
                          (= (.getNodeValue (.getNamedItem (.getAttributes c-i) "class"))
                             (.getNodeValue (.getNamedItem (.getAttributes c-j) "class")))
                          (catch Exception e true)))
                   (- sub-i (invert-cost c-i c-j del-cost ins-cost sub-cost))

                   :else
                   (+ sub-i sub-cost)))]
        (aset M (inc i) (inc j) (int (min del ins sub)))))
    (aget M m n)))

(defn rtdm-edit-distance-sim
  [tree-1 tree-2 del-cost ins-cost sub-cost]
  (let [t1-desc (tree-descendants tree-1)
        t2-desc (tree-descendants tree-2)]
    (- 1
       (/ (rtdm-edit-distance tree-1 tree-2 del-cost ins-cost sub-cost)
          (+ (* (+ (count t1-desc) 1) del-cost)
             (* (+ (count t2-desc) 1) sub-cost))))))

(defn get-xml-tree-body
  "Downloads a webpage and converts it to an org.w3.dom.Document"
  [page-src]

  (let [cleaner        (new HtmlCleaner)
        props          (.getProperties cleaner)
        cleaner-props  (new CleanerProperties)
        dom-serializer (new DomSerializer cleaner-props)
        tag-node       (.clean cleaner page-src)]

    (.createDOM dom-serializer tag-node)))

(defn rtdm-edit-distance-html
  [pg1 pg2 del-cost ins-cost sub-cost]
  (rtdm-edit-distance-sim
   (get-xml-tree-body pg1)
   (get-xml-tree-body pg2)
   del-cost
   ins-cost
   sub-cost))

A Comment on Dimension-Estimation

2013-11-29

dimension-reduction, dimension, grassberger-procaccia, k-nn-classifier

I saw this neat comment in a paper I was recently reading. If you have all i.i.d features and you want to estimate its dimension using Grassberger-Procaccia (which estimates dimension using a distance-based metric) or want to classify using a k-NN classifier, it is bad if the data points are mostly pairwise equidistant (for instance, a correlation integral plot will look like a step function and thus will be useless; a k-NN classifier will break because the test point ends up equidistant from all the existing points).

There is a trivial argument using the Hoeffding bound in Chris Burges’ paper that suggests that if the features are all i.i.d, a majority of pairwise distances will end up clustered tightly around a mean which means that k-NN or Grassberger-Procaccia won’t work well. I am going to repeat this argument here so I can remember it for later:

Our vectors are of dimension $ d $ and the components are $ \pm1 $. Assuming all the components are $ iid $, the Hoeffding bound gives us:

$$ P(||| x_{1} - x_{2} ||^{2} – 2d| > d\epsilon) = P(| x_{1} \cdot x_{2} | > d\epsilon/2) \le 2exp(-\frac{d\epsilon^2}{8})$$

and this shows us that most pairwise distances will end up clustered very tightly around a mean and this means that a majority of pairs of points in the dataset will end up equidistant and thus a $ k-NN $ classifier will fail.

This also means that the correlation integral is a good way to determine if a k-NN classifier will work well. If the plot resembles a spike, the distance function needs to change.

The correlation-integral is an immensely powerful tool and here’s an implementation

The Story of Phineas Gage

2013-11-27

phrenology, phineas-gage, psychology

I first encountered the story of Phineas Gage in my freshman year in college. Phineas had a projectile drill a hole in his head (while he worked the railroad) and he survived the incident. The incident caused a massive personality change in Phineas and conclusively ruled out phrenology as a reasonable model of how the brain worked.

After the incident Phineas’ life comprised a trip to South America, a gig at P.T. Barnum’s establishment and a massive change in our understanding of the brain. This wikipedia article covers a good chunk of this work.

Obituary

2013-11-22

obituary

Earlier this month, my paternal grandfather passed away. I didn’t speak to him before his death since I expected him to recover from his current bout of ill-health and I have since carried some guilt thanks to this.

I think this excerpt from a Pink Floyd song best summarizes his life:

Run, run rabbit run
Dig that hole, forget the sun
And when at last the work is done
Don’t sit down, it’s time to dig another one

-- [Breathe, Pink Floyd](http://rock.rapgenius.com/Pink-floyd-breathe-lyrics)

RIP.

Web Crawling - Dos and Don’ts

2013-10-15

research, itsy, crawling

For my SIGIR submission I have been working on finding efficient traversal strategies while crawling websites.

Web crawling is a straightforward graph-traversal problem. My research focuses on discarding unproductive paths and preserving bandwidth to find more information. I will write a post on it once I have my ideas fleshed out and thus that won’t be the focus of this post.

Here, I will describe the finer details needed to make your crawler polite and robust. An impolite crawler will incur the wrath of an admin and might get you banned. A crawler that isn’t robust cannot survive the onslaught of quirks that the WWW is full of.

Zephyros Racket API

2013-10-13

racket, macros, "window managers", zephyros

In the recent past, I wanted to control the OS X window manager from racket like I could on Linux using the X11 library. I found a very sweet Github project called zephyros that implemented a large number of vital routines (vital for managing windows anyway) and provided a simple protocol using json. Since it would be convenient to have a racket module, I wrote a wrapper around it.

Racket Whistlepig Bindings

2013-08-27

racket, ffi, racket-ffi, search, real-time, retrieval

Whistlepig is a lightweight real-time search engine written in ANSI C. (description and source) I heard about it when Don Metzler plugged it in an answer he wrote on quora. In this post, with very little code, I was able to build an index, query it and write a servlet that talks to the index using the FFI.

Clueweb12++ Status Report

2013-08-16

research, data processing, crawl, phd

I recently gave a talk at CMU on the state of the Clueweb12++ crawl. Here are the slides.

Twitter: @shriphani
Instagram: @life_of_ess
Fortior Per Mentem
(c) Shriphani Palakodety 2013-2020