A Clojure DSL for Web-Crawling

When building crawlers, most of the effort is expended in guiding them through a website. For example, if we want to crawl all pages and individual posts on this blog, we extract links like so:

  1. Visit current webpage
  2. Extract pagination links
  3. Extract link to each blog post
  4. Enqueue extracted links
  5. Continue

In this blog post, I present a new DSL that allows you to concisely describe this process.

This DSL is now part of this crawler: https://github.com/shriphani/pegasus

Modifying The Heritrix Web Crawler

This is a post I wrote to teach myself about Heritrix and modifying it. There are solid motivations for modifying web-crawlers (say we know how to beat a simple BFS for some specific website). In this post, I will modify a routine that is central to web-crawling - extracting URLs from a webpage.

Web Crawling - Dos and Don’ts

For my SIGIR submission I have been working on finding efficient traversal strategies while crawling websites.

Web crawling is a straightforward graph-traversal problem. My research focuses on discarding unproductive paths and preserving bandwidth to find more information. I will write a post on it once I have my ideas fleshed out and thus that won’t be the focus of this post.

Here, I will describe the finer details needed to make your crawler polite and robust. An impolite crawler will incur the wrath of an admin and might get you banned. A crawler that isn’t robust cannot survive the onslaught of quirks that the WWW is full of.

Per Intellectum, Vis
(c) Shriphani Palakodety 2013-2016