This is a post I wrote to teach myself about Heritrix and modifying it. There are solid motivations for modifying web-crawlers (say we know how to beat a simple BFS for some specific website). In this post, I will modify a routine that is central to web-crawling - extracting URLs from a webpage.
First, I am going to put together a simple extractor in Heritrix. This extractor uses an XPath (I used a very trivial XPath for the sake of this example). I use the HtmlCleaner library for parsing the supplied HTML and then used the XPath classes that ship with java (I have personally found that most Html parsing libraries bundle partial XPath implementations and I typically use more complex queries for my research so I prefer dealing with the org.w3c.xml.dom
documents.
This is what the extractor class looks like. It is super simple:
package org.archive.modules.extractor; | |
import org.archive.modules.CrawlURI; | |
import org.htmlcleaner.CleanerProperties; | |
import org.htmlcleaner.DomSerializer; | |
import org.htmlcleaner.HtmlCleaner; | |
import org.htmlcleaner.TagNode; | |
import org.w3c.dom.Document; | |
import org.w3c.dom.NamedNodeMap; | |
import org.w3c.dom.Node; | |
import org.w3c.dom.NodeList; | |
import javax.xml.xpath.XPath; | |
import javax.xml.xpath.XPathConstants; | |
import javax.xml.xpath.XPathExpressionException; | |
import javax.xml.xpath.XPathFactory; | |
import javax.xml.parsers.ParserConfigurationException; | |
/** | |
* Created by shriphani on 3/14/14. | |
* | |
* Code to take an XPath, convert to an XML object and extract URLs | |
* using XPath queries | |
*/ | |
public class XPathExtractor extends ExtractorHTML { | |
@Override | |
protected void extract(CrawlURI curi, CharSequence cs) { | |
// process the html | |
HtmlCleaner cleaner = new HtmlCleaner(); | |
CleanerProperties props = cleaner.getProperties(); | |
props.setPruneTags("script,style"); | |
TagNode node = cleaner.clean(cs.toString()); | |
// convert the HTML to an XML node | |
try { | |
Document document = new DomSerializer(new CleanerProperties()).createDOM(node); | |
XPath xpath = XPathFactory.newInstance().newXPath(); | |
NodeList nodes = (NodeList) xpath.evaluate(".//a", document, XPathConstants.NODESET); | |
for (int i = 0; i < nodes.getLength(); i++) { | |
Node n = nodes.item(i); | |
NamedNodeMap attributes = n.getAttributes(); | |
String href = attributes.getNamedItem("href").getTextContent(); | |
if (href != null && !href.isEmpty()) { | |
processEmbed(curi, href, elementContext(n.getNodeName(), "href")); | |
} | |
} | |
} catch (ParserConfigurationException e) { | |
// ignored | |
} catch (XPathExpressionException e) { | |
// ignored | |
} | |
} | |
} |
Now, to see it in action, you need to create a Heritrix job and specify that this is the extractor you want to use. I have a test job that crawls my blog. A heritrix job contains a configuration file where you can specify the extractors and some other details (seed links and all that). In this file, I specified the extractor class like so:
<bean id="extractorHtml" class="org.archive.modules.extractor.XPathExtractor">
(incidentally the entire file looks like this).
I was subsequently able to process a webpage and all that without too much fuss. In the near future, I plan to describe some of the more interesting stuff I’ve been able to do with heritrix.