Earlier this week I gave a talk on scraping with clojure  primarily using cljxpath
and enlive
at the Pittsburgh clojure meetup group. Slides and code are linked to below.
Code: https://github.com/shriphani/clojure_scraping_overview
I can clearly recall why I became a computer scientist. I was sitting in a class and we were discussing how cons
was implemented. And then I saw this definition:
The lambda calculus and the material in the little schemer kept me in the field (computerscience i.e.) and assured me that there would never be a dearth of aha! moments in my education.
Good educators can deliver such aha! moments in every single lecture. A good textbook can do it several times each chapter.
I have since tried to find material that delivers such aha! moments.
Hopefully, I will encounter them for the rest of my life in whatever I do.
The Vatican has a fantastic interactive view of Michelangeloâ€™s work in the Sistine Chapel.
I recently spent some time working on a simple linear algebra problem  decompose a matrix $ M $ into a lowrank component $ L $ and a sparse component $ S $. The algorithm I used was very trivial to implement (and parallelize using mapreduce).
In this post, I will implement this very simple algorithm, explain the objective function and demonstrate its (amazing) effectiveness on a surveillancecamera dataset.
I recently got my hands on the commoncrawl URL dataset from here and wanted to compute some stats like the number of unique domains contained in the corpus. This was the perfect opportunity to whip up a script that implemented Durande et. al’s loglog paper.
In my last post, I presented a codedump that computed a restricted version of the tree edit distance algorithm. I was able to achieve a decent speedup using enlive. Here’s a codedump:
I had to throw together an implementation of treeedit distance for clustering webpages based on their structure. It performs reasonably quickly. The algorithm itself. The repo is here.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134 
(ns treeeditdistance.core
(:use [clojure.pprint])
(:import (org.htmlcleaner HtmlCleaner DomSerializer CleanerProperties)
(org.w3c.dom Document)))
(defn init
"Perform the correct initialization"
[m n c1 c2 delcost inscost]
(let [M (makearray Integer/TYPE (inc m) (inc n))]
(do
(doseq [i (range (inc m))
j (range (inc n))]
(aset M i j (int
(+ (* delcost c1)
(* inscost c2)))))
M)))
(defn numchildren
"Expects a html tree"
[atree]
(if(.hasChildNodes atree)
(.getLength (.getChildNodes atree)) 0))
(defn treechildren
"Return level 1 children"
[atree]
(let [n (numchildren atree)
cs (.getChildNodes atree)]
(map
#(.item cs %)
(range n))))
(defn treedescendants
[atree]
(if (.hasChildNodes atree)
(concat (treechildren atree)
(flatten (map treedescendants (treechildren atree))))
[]))
(declare rtdmeditdistance)
(defn invertcost
[t1 t2 delcost inscost subcost]
(let [t1desc (treedescendants t1)
t2desc (treedescendants t2)]
( (+ (* delcost (count t1desc))
(* inscost (count t2desc)))
(rtdmeditdistance t1 t2 delcost inscost subcost))))
(defn rtdmeditdistance
"The RTDM algorithm for computing editdistance.
The trees are assumed to be org.w3c.dom.Documents"
[tree1 tree2 delcost inscost subcost]
(let [m (numchildren tree1)
n (numchildren tree2)
t1children (treechildren tree1)
t2children (treechildren tree2)
t1desc (treedescendants tree1)
t2desc (treedescendants tree2)
M (init m n (count t1desc) (count t2desc) delcost inscost)]
(doseq [i (range m)
j (range n)]
(let [ci (nth t1children i)
cj (nth t2children j)
cidesc (treedescendants ci)
cjdesc (treedescendants cj)
del (aget M i (inc j))
ins (aget M (inc i) j)
subi ( (aget M i j)
delcost
inscost)
sub (if (.isEqualNode ci cj)
( subi
(* inscost (count cjdesc))
(* delcost (count cidesc)))
(cond
(or (not (.hasChildNodes ci))
(not (.hasChildNodes cj)))
(+ subi subcost)
(and (= (.getNodeName ci) (.getNodeName cj))
(try
(= (.getNodeValue (.getNamedItem (.getAttributes ci) "id"))
(.getNodeValue (.getNamedItem (.getAttributes cj) "id")))
(catch Exception e true))
(try
(= (.getNodeValue (.getNamedItem (.getAttributes ci) "class"))
(.getNodeValue (.getNamedItem (.getAttributes cj) "class")))
(catch Exception e true)))
( subi (invertcost ci cj delcost inscost subcost))
:else
(+ subi subcost)))]
(aset M (inc i) (inc j) (int (min del ins sub)))))
(aget M m n)))
(defn rtdmeditdistancesim
[tree1 tree2 delcost inscost subcost]
(let [t1desc (treedescendants tree1)
t2desc (treedescendants tree2)]
( 1
(/ (rtdmeditdistance tree1 tree2 delcost inscost subcost)
(+ (* (+ (count t1desc) 1) delcost)
(* (+ (count t2desc) 1) subcost))))))
(defn getxmltreebody
"Downloads a webpage and converts it to an org.w3.dom.Document"
[pagesrc]
(let [cleaner (new HtmlCleaner)
props (.getProperties cleaner)
cleanerprops (new CleanerProperties)
domserializer (new DomSerializer cleanerprops)
tagnode (.clean cleaner pagesrc)]
(.createDOM domserializer tagnode)))
(defn rtdmeditdistancehtml
[pg1 pg2 delcost inscost subcost]
(rtdmeditdistancesim
(getxmltreebody pg1)
(getxmltreebody pg2)
delcost
inscost
subcost))

I saw this neat comment in a paper I was recently reading. If you have all i.i.d
features and you want to estimate its dimension using GrassbergerProcaccia (which estimates dimension using a distancebased metric) or want to classify using a kNN classifier, it is bad if the data points are mostly pairwise equidistant (for instance, a correlation integral plot will look like a step function and thus will be useless; a kNN classifier will break because the test point ends up equidistant from all the existing points).
There is a trivial argument using the Hoeffding bound in Chris Burges’ paper that suggests that if the features are all i.i.d
, a majority of pairwise distances will end up clustered tightly around a mean which means that kNN or GrassbergerProcaccia won’t work well. I am going to repeat this argument here so I can remember it for later:
Our vectors are of dimension $ d $ and the components are $ \pm1 $. Assuming all the components are $ iid $, the Hoeffding bound gives us:
$$ P( x_{1}  x_{2} ^{2} – 2d > d\epsilon) = P( x_{1} \cdot x_{2}  > d\epsilon/2) \le 2exp(\frac{d\epsilon^2}{8})$$
and this shows us that most pairwise distances will end up clustered very tightly around a mean and this means that a majority of pairs of points in the dataset will end up equidistant and thus a $ kNN $ classifier will fail.
This also means that the correlation integral is a good way to determine if a kNN classifier will work well. If the plot resembles a spike, the distance function needs to change.
The correlationintegral is an immensely powerful tool and here’s an implementation
I first encountered the story of Phineas Gage in my freshman year in college. Phineas had a projectile drill a hole in his head (while he worked the railroad) and he survived the incident. The incident caused a massive personality change in Phineas and conclusively ruled out phrenology as a reasonable model of how the brain worked.
After the incident Phineas’ life comprised a trip to South America, a gig at P.T. Barnum’s establishment and a massive change in our understanding of the brain. This wikipedia article covers a good chunk of this work.
Earlier this month, my paternal grandfather passed away. I didn’t speak to him before his death since I expected him to recover from his current bout of illhealth and I have since carried some guilt thanks to this.
I think this excerpt from a Pink Floyd song best summarizes his life:
Run, run rabbit run
Dig that hole, forget the sun
And when at last the work is done
Don’t sit down, it’s time to dig another one
 [Breathe, Pink Floyd](http://rock.rapgenius.com/Pinkfloydbreathelyrics)
RIP.