All It Took Was An AHA!

I can clearly recall why I became a computer scientist. I was sitting in a class and we were discussing how cons was implemented. And then I saw this definition:

1
2
3
4
5
6
7
8
9
(define (cons a b)
    (lambda (x)
        (if (= x 1) a b)))

(define (first l)
    (l 1))

(define (rest l)
    (l 2))

The lambda calculus and the material in the little schemer kept me in the field (computer-science i.e.) and assured me that there would never be a dearth of aha! moments in my education.

Good educators can deliver such aha! moments in every single lecture. A good textbook can do it several times each chapter.

I have since tried to find material that delivers such aha! moments.

Hopefully, I will encounter them for the rest of my life in whatever I do.


Robust Principal Component Pursuit - Background Matrix Recovery

I recently spent some time working on a simple linear algebra problem - decompose a matrix $ M $ into a low-rank component $ L $ and a sparse component $ S $. The algorithm I used was very trivial to implement (and parallelize using map-reduce).

In this post, I will implement this very simple algorithm, explain the objective function and demonstrate its (amazing) effectiveness on a surveillance-camera dataset.


Tree Edit Distance in Clojure

I had to throw together an implementation of tree-edit distance for clustering web-pages based on their structure. It performs reasonably quickly. The algorithm itself. The repo is here.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
(ns tree-edit-distance.core
  (:use [clojure.pprint])
  (:import (org.htmlcleaner HtmlCleaner DomSerializer CleanerProperties)
           (org.w3c.dom Document)))

(defn init
  "Perform the correct initialization"
  [m n c1 c2 del-cost ins-cost]
  (let [M (make-array Integer/TYPE (inc m) (inc n))]
    (do
      (doseq [i (range (inc m))
              j (range (inc n))]
        (aset M i j (int
                     (+ (* del-cost c1)
                        (* ins-cost c2)))))
      M)))

(defn num-children
  "Expects a html tree"
  [a-tree]
  (if(.hasChildNodes a-tree)
    (.getLength (.getChildNodes a-tree)) 0))

(defn tree-children
  "Return level 1 children"
  [a-tree]
  (let [n  (num-children a-tree)
        cs (.getChildNodes a-tree)]
    (map
     #(.item cs %)
     (range n))))

(defn tree-descendants
  [a-tree]
  (if (.hasChildNodes a-tree)
    (concat (tree-children a-tree)
            (flatten (map tree-descendants (tree-children a-tree))))
    []))

(declare rtdm-edit-distance)

(defn invert-cost
  [t1 t2 del-cost ins-cost sub-cost]
  (let [t1-desc (tree-descendants t1)
        t2-desc (tree-descendants t2)]
    (- (+ (* del-cost (count t1-desc))
          (* ins-cost (count t2-desc)))
       (rtdm-edit-distance t1 t2 del-cost ins-cost sub-cost))))

(defn rtdm-edit-distance
  "The RTDM algorithm for computing edit-distance.
   The trees are assumed to be org.w3c.dom.Documents"
  [tree-1 tree-2 del-cost ins-cost sub-cost]

  (let [m (num-children tree-1)
        n (num-children tree-2)

        t1-children (tree-children tree-1)
        t2-children (tree-children tree-2)

        t1-desc (tree-descendants tree-1)
        t2-desc (tree-descendants tree-2)

        M (init m n (count t1-desc) (count t2-desc) del-cost ins-cost)]

    (doseq [i (range m)
            j (range n)]
      (let [c-i (nth t1-children i)
            c-j (nth t2-children j)

            c-i-desc (tree-descendants c-i)
            c-j-desc (tree-descendants c-j)

            del (aget M i (inc j))
            ins (aget M (inc i) j)

            sub-i (- (aget M i j)
                     del-cost
                     ins-cost)

            sub (if (.isEqualNode c-i c-j)
                  (- sub-i
                     (* ins-cost (count c-j-desc))
                     (* del-cost (count c-i-desc)))
                  (cond
                   (or (not (.hasChildNodes c-i))
                       (not (.hasChildNodes c-j)))
                   (+ sub-i sub-cost)

                   (and (= (.getNodeName c-i) (.getNodeName c-j))
                        (try
                          (= (.getNodeValue (.getNamedItem (.getAttributes c-i) "id"))
                             (.getNodeValue (.getNamedItem (.getAttributes c-j) "id")))
                          (catch Exception e true))
                        (try
                          (= (.getNodeValue (.getNamedItem (.getAttributes c-i) "class"))
                             (.getNodeValue (.getNamedItem (.getAttributes c-j) "class")))
                          (catch Exception e true)))
                   (- sub-i (invert-cost c-i c-j del-cost ins-cost sub-cost))

                   :else
                   (+ sub-i sub-cost)))]
        (aset M (inc i) (inc j) (int (min del ins sub)))))
    (aget M m n)))

(defn rtdm-edit-distance-sim
  [tree-1 tree-2 del-cost ins-cost sub-cost]
  (let [t1-desc (tree-descendants tree-1)
        t2-desc (tree-descendants tree-2)]
    (- 1
       (/ (rtdm-edit-distance tree-1 tree-2 del-cost ins-cost sub-cost)
          (+ (* (+ (count t1-desc) 1) del-cost)
             (* (+ (count t2-desc) 1) sub-cost))))))

(defn get-xml-tree-body
  "Downloads a webpage and converts it to an org.w3.dom.Document"
  [page-src]

  (let [cleaner        (new HtmlCleaner)
        props          (.getProperties cleaner)
        cleaner-props  (new CleanerProperties)
        dom-serializer (new DomSerializer cleaner-props)
        tag-node       (.clean cleaner page-src)]

    (.createDOM dom-serializer tag-node)))

(defn rtdm-edit-distance-html
  [pg1 pg2 del-cost ins-cost sub-cost]
  (rtdm-edit-distance-sim
   (get-xml-tree-body pg1)
   (get-xml-tree-body pg2)
   del-cost
   ins-cost
   sub-cost))

A Comment on Dimension-Estimation

I saw this neat comment in a paper I was recently reading. If you have all i.i.d features and you want to estimate its dimension using Grassberger-Procaccia (which estimates dimension using a distance-based metric) or want to classify using a k-NN classifier, it is bad if the data points are mostly pairwise equidistant (for instance, a correlation integral plot will look like a step function and thus will be useless; a k-NN classifier will break because the test point ends up equidistant from all the existing points).

There is a trivial argument using the Hoeffding bound in Chris Burges’ paper that suggests that if the features are all i.i.d, a majority of pairwise distances will end up clustered tightly around a mean which means that k-NN or Grassberger-Procaccia won’t work well. I am going to repeat this argument here so I can remember it for later:

Our vectors are of dimension $ d $ and the components are $ \pm1 $. Assuming all the components are $ iid $, the Hoeffding bound gives us:

$$ P(||| x_{1} - x_{2} ||^{2} – 2d| > d\epsilon) = P(| x_{1} \cdot x_{2} | > d\epsilon/2) \le 2exp(-\frac{d\epsilon^2}{8})$$

and this shows us that most pairwise distances will end up clustered very tightly around a mean and this means that a majority of pairs of points in the dataset will end up equidistant and thus a $ k-NN $ classifier will fail.

This also means that the correlation integral is a good way to determine if a k-NN classifier will work well. If the plot resembles a spike, the distance function needs to change.

The correlation-integral is an immensely powerful tool and here’s an implementation


The Story of Phineas Gage

I first encountered the story of Phineas Gage in my freshman year in college. Phineas had a projectile drill a hole in his head (while he worked the railroad) and he survived the incident. The incident caused a massive personality change in Phineas and conclusively ruled out phrenology as a reasonable model of how the brain worked.

After the incident Phineas’ life comprised a trip to South America, a gig at P.T. Barnum’s establishment and a massive change in our understanding of the brain. This wikipedia article covers a good chunk of this work.


Obituary

Earlier this month, my paternal grandfather passed away. I didn’t speak to him before his death since I expected him to recover from his current bout of ill-health and I have since carried some guilt thanks to this.

I think this excerpt from a Pink Floyd song best summarizes his life:

Run, run rabbit run
Dig that hole, forget the sun
And when at last the work is done
Don’t sit down, it’s time to dig another one

-- [Breathe, Pink Floyd](http://rock.rapgenius.com/Pink-floyd-breathe-lyrics)

RIP.



Per Intellectum, Vis
(c) Shriphani Palakodety 2013-2016