When building crawlers, most of the effort is expended in guiding them through a website. For example, if we want to crawl all pages and individual posts on this blog, we extract links like so:
- Visit current webpage
- Extract pagination links
- Extract link to each blog post
- Enqueue extracted links
In this blog post, I present a new DSL that allows you to concisely describe this process.
This DSL is now part of this crawler: https://github.com/shriphani/pegasus
This is a post I wrote to teach myself about Heritrix and modifying it. There are solid motivations for modifying web-crawlers (say we know how to beat a simple BFS for some specific website). In this post, I will modify a routine that is central to web-crawling - extracting URLs from a webpage.
For my SIGIR submission I have been working on finding efficient traversal strategies while crawling websites.
Web crawling is a straightforward graph-traversal problem. My research focuses on discarding unproductive paths and preserving bandwidth to find more information. I will write a post on it once I have my ideas fleshed out and thus that won’t be the focus of this post.
Here, I will describe the finer details needed to make your crawler polite and robust. An impolite crawler will incur the wrath of an admin and might get you banned. A crawler that isn’t robust cannot survive the onslaught of quirks that the WWW is full of.