I recently gave a talk at CMU on the state of the Clueweb12++ crawl. Here are the slides.
The Clueweb12++ crawl aims at accumulating social media content from the Clueweb crawl’s time frame. Our pipeline thus far was as follows:
- Download a bunch of index pages from forums (index pages link to threads).
- Identify posts that fall in the time-frame specified.
- Download posts and recreate web-graph to give the impression of a crawl completed in the 2012 time-frame.
There is one complicated time-frame in this setup - step 2. Dates processing is a nuisance that I would not wish upon anyone else. There are an innumerable number of surface representations (that can be ambiguous) and to add to our troubles, people do stuff like use “Last Week” to indicate time of activity.
The most accurate tool is SUTime but on a crawl the size of ClueWeb, it is foolish to run such a crawl on it. So what we do is use Natty. Natty is fast and reasonably accurate.
I’ve uploaded a java module to github that will spit out a list of dates. You can obtain it here.
(c) Shriphani Palakodety 2013-2018