When building crawlers, most of the effort is expended in guiding them through a website. For example, if we want to crawl all pages and individual posts on this blog, we extract links like so:
- Visit current webpage
- Extract pagination links
- Extract link to each blog post
- Enqueue extracted links
In this blog post, I present a new DSL that allows you to concisely describe this process.
This DSL is now part of this crawler: https://github.com/shriphani/pegasus