Clueweb12++ Status Report
I recently gave a talk at CMU on the state of the Clueweb12++ crawl. Here are the slides.
I recently gave a talk at CMU on the state of the Clueweb12++ crawl. Here are the slides.
The Clueweb12++ crawl aims at accumulating social media content from the Clueweb crawl’s time frame. Our pipeline thus far was as follows:
There is one complicated time-frame in this setup - step 2. Dates processing is a nuisance that I would not wish upon anyone else. There are an innumerable number of surface representations (that can be ambiguous) and to add to our troubles, people do stuff like use “Last Week” to indicate time of activity.
The most accurate tool is SUTime but on a crawl the size of ClueWeb, it is foolish to run such a crawl on it. So what we do is use Natty. Natty is fast and reasonably accurate.
I’ve uploaded a java module to github that will spit out a list of dates. You can obtain it here.
Twitter: @shriphani
Instagram: @life_of_ess
Fortior Per Mentem
(c) Shriphani Palakodety 2013-2020