When building crawlers, most of the effort is expended in guiding them through a website. For example, if we want to crawl all pages and individual posts on this blog, we extract links like so:
- Visit current webpage
- Extract pagination links
- Extract link to each blog post
- Enqueue extracted links
- Continue
In this blog post, I present a new DSL that allows you to concisely describe this process.
This DSL is now part of this crawler: https://github.com/shriphani/pegasus
Enlive provides an idiomatic way to select elements and it forms the foundation of most of the work here. Let us implement the crawler I discussed above:
Since the focus (in this blog post at least) in on extracting links, let us look at the DSL:
1 2 3 4 5 6 7 8 9 10 11 12 |
(defextractors (extract :at-selector [:article :header :h2 :a] :follow :href :with-regex #"blog.shriphani.com") (extract :at-selector [:ul.pagination :a] :follow :href :with-regex #"blog.shriphani.com")) |
And that is it! We specified an elive selector to pull tags, the attribute entry to follow and then filter these URLs with a regex.
With pegasus, the full crawler is expressed in under 10 lines as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
(defn crawl-sp-blog-custom-extractor [] (crawl {:seeds ["http://blog.shriphani.com"] :user-agent "Pegasus web crawler" :extractor (defextractors (extract :at-selector [:article :header :h2 :a] :follow :href :with-regex #"blog.shriphani.com") (extract :at-selector [:ul.pagination :a] :follow :href :with-regex #"blog.shriphani.com")) :corpus-size 20 ;; crawl 20 documents :job-dir "/tmp/sp-blog-corpus"})) |
Essentially, under 20 lines.
The DSL was partially inspired by the great work done in this crawler for Node.js: https://github.com/rchipka/node-osmosis
Links: