Augmenting enlive

In manipulating HTML documents for features, I find myself needing to use some operations all the time - removing script tags, comments and the like. This feature-set is available in HtmlCleaner and I thus merged the two libraries to produce enlive-helper.

Now you can do:

1
2
3
(html-resource-steroids 
 (java.io.StringReader. "<html><body><a>hi</a></body></html>") 
 :prune-tags "a")

And as a result the a tag is not picked up:

1
2
3
4
5
6
7
({:tag :html,
  :attrs nil,
  :content
  ("\n"
   {:tag :head, :attrs nil, :content nil}
   "\n"
   {:tag :body, :attrs nil, :content nil})})

The options you can pass mirror those of HtmlCleaner. Full docs available in this github repo.

Also, the code is something I threw together from my research so it is released under Matt Might’s CRAPL license.


Diagnosis by Google Doesn’t Work

I have often Googled for symptoms, visited WebMD (and concluded that I have a deadly disease). At SIGIR 2013, Ryen White’s paper, Beliefs and Biases in IR, provided empirical evidence for the poor success-rate of diagnosis-by-google.

The authors mined medical yes/no questions (For example: Can salmonella cause belly-ache), had physicians answer these questions, and then measured user bias post-search (i.e. the users after perusing the results answer their original questions with yes/no) (the paper contains a very detailed description of the experiments conducted).

The accuracy of the final answer was the most interesting part of this paper - only about half of the questions were accurately answered. That is as good as flipping a (fair) coin for each question. The rest of the paper was a fairly interesting read (and it won the SIGIR 2013 best paper award).


Consistent Hashing in Clojure

I wrote this post to teach myself consistent hashing - a simple hash family that Akamai’s founders came up with. This was originally done to prepare for a talk in my grad algorithms class (I made a horlicks of the talk but whatever). I am going to provide intuition, analysis and a clojure implementation.


Fertitlity Rates and Prosperity

Singapore’s government (and Mentos was involved in this awkward project) used this ad on their National Day celebrations to encourage people to copulate and increase Singapore’s birth rate. I hypothesized that a low fertility rate (number of children per woman) was not unique to Singapore (though the problem might be more acute in Singapore).

Here’s a plot of GDP vs. fertility rates for all nations. The red dots are the devloped countries:

Clearly the poorest nations have a ridiculously high fertility rate. The red dots represent the developed economies - almost all of which lie below the replacement fertility rate of 2.1

Here’s a plot of just the developed economies:

And a similar plot for developed Asian economies:

Full source code and datasets available in this github repo.


Clojure/Java String trim

Java’s string trim routine tests for whitespace using Character.isWhitespace and so does Clojure’s clojure.string/trim.

While processing a dataset off the web containing unicode space characters, the trim routine failed to do anything useful. Luckily, a StackOverflow thread suggested using a routine from Google’s guava library. So in Clojure, you can do this:

1
(.trimFrom CharMatcher/WHITESPACE %)

and this will do the job.


The TW-IDF Model

I found this recent paper on adhoc-IR from CIKM 2013. I haven’t worked in this area but the results seem promising. Essentially, instead of TF, you store an indegree count. The graph in question is the term co-occurrence graph within a window with the direction indicating word-order. The paper won hon. mention at the CIKM so it is clearly very cool.

Link to paper


Modifying The Heritrix Web Crawler

This is a post I wrote to teach myself about Heritrix and modifying it. There are solid motivations for modifying web-crawlers (say we know how to beat a simple BFS for some specific website). In this post, I will modify a routine that is central to web-crawling - extracting URLs from a webpage.


Clojure Blekko API

Back when Blekko had an API, I threw this implementation of their API together for a small project I was working on. You can only query blekko with this (the API allows slashtag manipulation and all that which I didn’t bother with).

Anyway, you can do this:

Usage

With leiningen:

[clj_blekko "0.1.0"]

With maven:

<dependency>
  <groupId>clj_blekko</groupId>
  <artifactId>clj_blekko</artifactId>
  <version>0.1.0</version>
</dependency>

You can just do this:

user> (use 'clj-blekko.core :reload)

user> (clojure.pprint/pprint (:RESULT (run-query "shriphani" api-key :json true :page 2)))
[{:c 1,
  :display_url
  "kindle.amazon.com/profile/Werner-Vogels/160/followers/65",
  :n_group 41,
  :short_host "kindle.amazon.com",
  :short_host_url "http://kindle.amazon.com/",
  :snippet
  "1 Followers 0 Books with Public Notes.  Peter van der Reijden.  5 Followers 0 Books with Public Notes.  4 Followers 0 Books with Public Notes.  0 Followers 0 Books with Public Notes.",
  :url
  "https://kindle.amazon.com/profile/Werner-Vogels/160/followers/65",
  :url_title "Amazon Kindle - Werner Vogels - Followers"}
 {:c 2,
  :display_url "yelp.com/biz/M8P96GmGImU0KOrFMfVCKw",
  :n_group 42,
  :short_host "yelp.com",
  :short_host_url "http://www.yelp.com/",
  :snippet
  "33 reviews for Blue Nile Restaurant.  Blue Nile Restaurant.  117 Northwestern Ave Ste 2 West Lafayette, IN 47906.  Mon-Sat 11 am - 11 pm.  Sun 12 pm - 10 pm.",
  :url "http://www.yelp.com/biz/M8P96GmGImU0KOrFMfVCKw",
  :url_title
  "<strong>Yelp.com</strong> - Blue Nile Restaurant - West Lafayette, IN"}
 {:c 3,
  :display_url "yelp.com/biz/blue-ni
le-restaurant-west-lafayette",
  :n_group 43,
  :short_host "yelp.com",
  :short_host_url "http://www.yelp.com/",
  :snippet
  "34 reviews for Blue Nile Restaurant.  Blue Nile Restaurant.  117 Northwestern Ave Ste 2 West Lafayette, IN 47906.  Mon-Sat 11 am - 11 pm.  Sun 12 pm - 10 pm.",
  :url "http://www.yelp.com/biz/blue-nile-restaurant-west-lafayette",
  :url_title
  "<strong>Yelp.com</strong> - Blue Nile Restaurant - West Lafayette, IN"}
 {:c 4,
  :display_url "amazon.com/gp/product/0679776222?link_code=as3",
  :n_group 44,
  :short_host "amazon.com",
  :short_host_url "http://www.amazon.com/",
  :snippet
  "Frequently Bought Together.  Customers Who Bought This Item Also Bought.  More About the Authors.  Very Bad Poetry Paperback.  Very Bad Poetry (Vintage) and over one million other books are available for Amazon Kindle.",
  :url "http://www.amazon.com/gp/product/0679776222?link_code=as3",
  :url_title
  "<strong>Amazon.com</strong> - Very Bad Poetry - Ross Petras, Kathryn Petras - 9780679776222 - Amazo
n.com - Books"}
 {:c 5,
  :display_url "yelp.com/biz/shaukin-indian-fast-food-west-lafayette",
  :n_group 45,
  :short_host "yelp.com",
  :short_host_url "http://www.yelp.com/",
  :snippet
  "23 reviews for Shaukin Indian Fast Food.  138 S River Rd West Lafayette, IN 47906.  Tue-Thu 4 pm - 10 pm.  Fri-Sun 12 pm - 10 pm.  Good for Kids.",
  :url
  "http://www.yelp.com/biz/shaukin-indian-fast-food-west-lafayette",
  :url_title
  "<strong>Yelp.com</strong> - Shaukin Indian Fast Food - West Lafayette, IN"}
 {:c 6,
  :display_url
  "reddit.com/.../who_here_doesnt_sympathize_with_g20_rioters",
  :n_group 46,
  :short_host "reddit.com",
  :short_host_url "http://www.reddit.com/",
  :snippet
  "Login or register in seconds.  Limit my search to /r/worldnews.  Use the following search parameters to narrow your results.  Search for &quot;text&quot; in url.  Search for &quot;text&quot; in self post contents.",
  :url
  "http://www.reddit.com/r/worldnews/comments/cjhse/who_here_doesnt_sympathize_with_g20_rioters/",
  :url_title

  "<strong>Too Many Requests</strong> - Who here doesn&#39;t sympathize with G20 rioters destroying people&#39;s property - worldnews"}
 {:c 7,
  :display_url "yelp.ie/biz/shaukin-indian-fast-food-west-lafayette",
  :n_group 47,
  :short_host "yelp.ie",
  :short_host_url "http://www.yelp.ie/",
  :snippet
  "Recommended Reviews for Shaukin Indian Fast Food.  Cookies help us deliver our services.  By using our services, you agree to our use of cookies.  138 S River Rd West Lafayette, IN 47906.  Good for Children.",
  :url
  "http://www.yelp.ie/biz/shaukin-indian-fast-food-west-lafayette",
  :url_title "Shaukin Indian Fast Food - West Lafayette, IN"}
 {:c 8,
  :display_url
  "en.yelp.be/biz/shaukin-indian-fast-food-west-lafayette",
  :n_group 48,
  :short_host "en.yelp.be",
  :short_host_url "http://en.yelp.be/",
  :snippet
  "25 reviews for Shaukin Indian Fast Food.  138 S River Rd West Lafayette, IN 47906.  Good for Children.  Accepts Credit Cards.  Good for Groups.",
  :url "http://en.yelp.be/biz/shaukin-ind
ian-fast-food-west-lafayette",
  :url_title "Shaukin Indian Fast Food - West Lafayette, IN"}
 {:c 9,
  :display_url "quora.com/Colin-Ho",
  :n_group 49,
  :short_host "quora.com",
  :short_host_url "http://www.quora.com/",
  :snippet
  "You must sign in to read Quora past the first answer.  Login to Quora.  Complete your account on Quora.  There are some updates to this page that haven&#39;t been applied yet because you&#39;ve entered some data into a form.  Refresh this page to receive new updates.",
  :url "http://www.quora.com/Colin-Ho",
  :url_title "Colin Ho - <strong>Quora</strong>"}
 {:c 10,
  :display_url
  "quora.com/...change-the-world-the-most-within-the-next-25-years",
  :n_group 50,
  :short_host "quora.com",
  :short_host_url "http://www.quora.com/",
  :snippet
  "You must sign in to read past the first answer.  Complete Your Profile.  Login to Quora.  You must sign in to read all of Quora.  You must be signed in to read this answer.",
  :url
  "http://www.quora.com/Which-technological-innovatio
n-will-change-the-world-the-most-within-the-next-25-years",
  :url_title
  "<strong>Quora</strong> - Which technological innovation will change the world the most within the next 25 years - Quora"}
 {:display_url "meetup.com/Clojure-PGH",
  :short_host "meetup.com",
  :c 11,
  :url_title
  "<strong>Meetup</strong> - Pittsburgh Clojure Users Group (Pittsburgh, PA) - Meetup",
  :n_group 51,
  :doc_date "Mar 2010",
  :url "http://www.meetup.com/Clojure-PGH/",
  :short_host_url "http://www.meetup.com/",
  :snippet
  "Welcome old lispers and new schemers.  Come to our next event to meet other programmers interested in the latest secret alien technology.  We are building a community of people who want to learn from and teach others about Clojure.  Talking with printed notes is encouraged, powerpoints are forbidden.",
  :doc_date_iso "2010-03-04 00:00:00"}
 {:c 12,
  :display_url
  "quora.com/.../How-do-you-ensure-that-TAs-for-introductory-CS...",
  :n_group 52,
  :short_host "quora.com",
  :short_host_url "http://w
ww.quora.com/",
  :snippet
  "You must sign in to read past the first answer.  Complete Your Profile.  Login to Quora.  You must sign in to read all of Quora.  You must be signed in to read this answer.",
  :url
  "http://www.quora.com/Computer-Science-Education/How-do-you-ensure-that-TAs-for-introductory-CS-classes-teach-at-a-high-quality",
  :url_title
  "<strong>Quora</strong> - Computer Science Education - How do you ensure that TAs for introductory CS classes teach at ... "}
 {:c 13,
  :display_url
  "quora.com/.../Is-it-already-too-late-to-get-on-the-wave-of-f...",
  :n_group 53,
  :short_host "quora.com",
  :short_host_url "http://www.quora.com/",
  :snippet
  "If I were to start learning Scala and the functional programming paradigm (coming from imperative languages), am I getting early or late to the party.  I&#39;m a functional learner myself.",
  :url
  "http://www.quora.com/Scala/Is-it-already-too-late-to-get-on-the-wave-of-functional-programming-and-Scala",
  :url_title
  "<strong>Quora</strong> 
- Is it already too late to get on the wave of functional programming and Scala"}
 {:c 14,
  :display_url "python.org/doc/3.0.1/about.html",
  :n_group 54,
  :short_host "python.org",
  :short_host_url "http://www.python.org/",
  :snippet
  "Contributors to the Python Documentation.  About these documents.  These documents are generated from reStructuredText sources by Sphinx, a document processor specifically written for the Python documentation.",
  :url "http://www.python.org/doc/3.0.1/about.html",
  :url_title
  "<strong>Python Programming Language</strong> - About these documents — Python v3.0.1 documentation"}]

Anyway, thought it might be useful (even though blekko’s gone and nuked free access to their API).



Per Intellectum, Vis
(c) Shriphani Palakodety 2013-2016