Augmenting enlive

In manipulating HTML documents for features, I find myself needing to use some operations all the time - removing script tags, comments and the like. This feature-set is available in HtmlCleaner and I thus merged the two libraries to produce enlive-helper.

Now you can do:

 1 2 3 (html-resource-steroids (java.io.StringReader. "hi") :prune-tags "a") 

And as a result the a tag is not picked up:

 1 2 3 4 5 6 7 ({:tag :html, :attrs nil, :content ("\n" {:tag :head, :attrs nil, :content nil} "\n" {:tag :body, :attrs nil, :content nil})}) 

The options you can pass mirror those of HtmlCleaner. Full docs available in this github repo.

Also, the code is something I threw together from my research so it is released under Matt Might’s CRAPL license.

clojure scraping overview

Earlier this week I gave a talk on scraping with clojure - primarily using clj-xpath and enlive at the Pittsburgh clojure meetup group. Slides and code are linked to below.

Tree Edit Distance Enlive Version

In my last post, I presented a code-dump that computed a restricted version of the tree edit distance algorithm. I was able to achieve a decent speed-up using enlive. Here’s a code-dump: