Posts tagged 'scraping'

A Clojure DSL for Web-Crawling

2016-11-16

clojure, web-crawling, dsl, crawling, scraping

When building crawlers, most of the effort is expended in guiding them through a website. For example, if we want to crawl all pages and individual posts on this blog, we extract links like so:

Visit current webpage
Extract pagination links
Extract link to each blog post
Enqueue extracted links
Continue

In this blog post, I present a new DSL that allows you to concisely describe this process.

This DSL is now part of this crawler: https://github.com/shriphani/pegasus

Augmenting enlive

2014-05-16

clojure, enlive, htmlcleaner, scraping

In manipulating HTML documents for features, I find myself needing to use some operations all the time - removing script tags, comments and the like. This feature-set is available in HtmlCleaner and I thus merged the two libraries to produce enlive-helper.

Now you can do:

1
2
3

(html-resource-steroids 
 (java.io.StringReader. "<html><body><a>hi</a></body></html>") 
 :prune-tags "a")

And as a result the a tag is not picked up:

({:tag :html,
  :attrs nil,
  :content
  ("\n"
   {:tag :head, :attrs nil, :content nil}
   "\n"
   {:tag :body, :attrs nil, :content nil})})

The options you can pass mirror those of HtmlCleaner. Full docs available in this github repo.

Also, the code is something I threw together from my research so it is released under Matt Might’s CRAPL license.

clojure scraping overview

2014-02-14

clojure, scraping, clj-xpath, enlive

Earlier this week I gave a talk on scraping with clojure - primarily using clj-xpath and enlive at the Pittsburgh clojure meetup group. Slides and code are linked to below.

Code: https://github.com/shriphani/clojure_scraping_overview