Augmenting enlive

In manipulating HTML documents for features, I find myself needing to use some operations all the time - removing script tags, comments and the like. This feature-set is available in HtmlCleaner and I thus merged the two libraries to produce enlive-helper.

Now you can do:

 ( "<html><body><a>hi</a></body></html>") 
 :prune-tags "a")

And as a result the a tag is not picked up:

({:tag :html,
  :attrs nil,
   {:tag :head, :attrs nil, :content nil}
   {:tag :body, :attrs nil, :content nil})})

The options you can pass mirror those of HtmlCleaner. Full docs available in this github repo.

Also, the code is something I threw together from my research so it is released under Matt Might’s CRAPL license.

Diagnosis by Google Doesn’t Work

I have often Googled for symptoms, visited WebMD (and concluded that I have a deadly disease). At SIGIR 2013, Ryen White’s paper, Beliefs and Biases in IR, provided empirical evidence for the poor success-rate of diagnosis-by-google.

The authors mined medical yes/no questions (For example: Can salmonella cause belly-ache), had physicians answer these questions, and then measured user bias post-search (i.e. the users after perusing the results answer their original questions with yes/no) (the paper contains a very detailed description of the experiments conducted).

The accuracy of the final answer was the most interesting part of this paper - only about half of the questions were accurately answered. That is as good as flipping a (fair) coin for each question. The rest of the paper was a fairly interesting read (and it won the SIGIR 2013 best paper award).

Consistent Hashing in Clojure

I wrote this post to teach myself consistent hashing - a simple hash family that Akamai’s founders came up with. This was originally done to prepare for a talk in my grad algorithms class (I made a horlicks of the talk but whatever). I am going to provide intuition, analysis and a clojure implementation.

Fertitlity Rates and Prosperity

Singapore’s government (and Mentos was involved in this awkward project) used this ad on their National Day celebrations to encourage people to copulate and increase Singapore’s birth rate. I hypothesized that a low fertility rate (number of children per woman) was not unique to Singapore (though the problem might be more acute in Singapore).

Here’s a plot of GDP vs. fertility rates for all nations. The red dots are the devloped countries:

Clearly the poorest nations have a ridiculously high fertility rate. The red dots represent the developed economies - almost all of which lie below the replacement fertility rate of 2.1

Here’s a plot of just the developed economies:

And a similar plot for developed Asian economies:

Full source code and datasets available in this github repo.

Clojure/Java String trim

Java’s string trim routine tests for whitespace using Character.isWhitespace and so does Clojure’s clojure.string/trim.

While processing a dataset off the web containing unicode space characters, the trim routine failed to do anything useful. Luckily, a StackOverflow thread suggested using a routine from Google’s guava library. So in Clojure, you can do this:

(.trimFrom CharMatcher/WHITESPACE %)

and this will do the job.

The TW-IDF Model

I found this recent paper on adhoc-IR from CIKM 2013. I haven’t worked in this area but the results seem promising. Essentially, instead of TF, you store an indegree count. The graph in question is the term co-occurrence graph within a window with the direction indicating word-order. The paper won hon. mention at the CIKM so it is clearly very cool.

Link to paper

Fortior Per Mentem
(c) Shriphani Palakodety 2013-2018