## Subotai: Data Mining for HTML Documents

I spent the last few months studying and implementing some routines that take a raw HTML document (or documents) and do stuff with it (them). Subotai is a library that consolidates some of these routines. In this blog post I will describe what is currently implemented and what the roadmap is.

## Disco Dora Maar

Disco Dora Maar, made with Quil and Clojure.

## Disco Rectangles

I was playing with quil recently (got a project planned which I will speak about later) and managed to throw this together in a short while:

Clojure source available here.

## Gianni Agnelli

Of Fiat, Ferrari, Maserati, Sestriere fame. Also, the best dressed man ever (IMO):

## Visualizing the most powerful brands by industry

Hover on the arcs for details.

Data from Forbes, plotted using d3.js

## Wikipedia Server Requests By the Hour

I found this dataset of server requests to Wikipedia. This is a plot of the server requests made by the hour on 19th of September, 2007.

Code and processed dataset used to generate this plot are in this repo.

## Augmenting enlive

In manipulating HTML documents for features, I find myself needing to use some operations all the time - removing script tags, comments and the like. This feature-set is available in HtmlCleaner and I thus merged the two libraries to produce enlive-helper.

Now you can do:

 1 2 3 (html-resource-steroids (java.io.StringReader. "hi") :prune-tags "a") 

And as a result the a tag is not picked up:

 1 2 3 4 5 6 7 ({:tag :html, :attrs nil, :content ("\n" {:tag :head, :attrs nil, :content nil} "\n" {:tag :body, :attrs nil, :content nil})}) 

The options you can pass mirror those of HtmlCleaner. Full docs available in this github repo.

Also, the code is something I threw together from my research so it is released under Matt Might’s CRAPL license.

## Diagnosis by Google Doesn’t Work

I have often Googled for symptoms, visited WebMD (and concluded that I have a deadly disease). At SIGIR 2013, Ryen White’s paper, Beliefs and Biases in IR, provided empirical evidence for the poor success-rate of diagnosis-by-google.

The authors mined medical yes/no questions (For example: Can salmonella cause belly-ache), had physicians answer these questions, and then measured user bias post-search (i.e. the users after perusing the results answer their original questions with yes/no) (the paper contains a very detailed description of the experiments conducted).

The accuracy of the final answer was the most interesting part of this paper - only about half of the questions were accurately answered. That is as good as flipping a (fair) coin for each question. The rest of the paper was a fairly interesting read (and it won the SIGIR 2013 best paper award).

## Consistent Hashing in Clojure

I wrote this post to teach myself consistent hashing - a simple hash family that Akamai’s founders came up with. This was originally done to prepare for a talk in my grad algorithms class (I made a horlicks of the talk but whatever). I am going to provide intuition, analysis and a clojure implementation.

## Watch Date Change Mechanism

Someone on r/watches posted this image of a date-change mechanism that illustrates why you shouldn’t change the date close to midnight:

