Posts tagged 'crawling'

A Clojure DSL for Web-Crawling

2016-11-16

clojure, web-crawling, dsl, crawling, scraping

When building crawlers, most of the effort is expended in guiding them through a website. For example, if we want to crawl all pages and individual posts on this blog, we extract links like so:

Visit current webpage
Extract pagination links
Extract link to each blog post
Enqueue extracted links
Continue

In this blog post, I present a new DSL that allows you to concisely describe this process.

This DSL is now part of this crawler: https://github.com/shriphani/pegasus

Modifying The Heritrix Web Crawler

2014-03-13

java, heritrix, crawling, web-crawling

This is a post I wrote to teach myself about Heritrix and modifying it. There are solid motivations for modifying web-crawlers (say we know how to beat a simple BFS for some specific website). In this post, I will modify a routine that is central to web-crawling - extracting URLs from a webpage.

Web Crawling - Dos and Don’ts

2013-10-15

research, itsy, crawling

For my SIGIR submission I have been working on finding efficient traversal strategies while crawling websites.

Web crawling is a straightforward graph-traversal problem. My research focuses on discarding unproductive paths and preserving bandwidth to find more information. I will write a post on it once I have my ideas fleshed out and thus that won’t be the focus of this post.

Here, I will describe the finer details needed to make your crawler polite and robust. An impolite crawler will incur the wrath of an admin and might get you banned. A crawler that isn’t robust cannot survive the onslaught of quirks that the WWW is full of.