Full description

RepLab 2013 dataset uses Twitter data in English and Spanish (more than 142,000 tweets). The balance between both languages depends on the availability of data for each of the entities included in the dataset. The corpus consists of a collection of tweets referring to a selected set of 61 entities from four domains: automotive, banking, universities and music/artists. The domain selection was done to offer a variety of scenarios for reputation studies. Crawling was performed during the period from the 1st June 2012 till the 31st Dec 2012 using the entity’s canonical name as query. For each entity, at least 2,200 tweets are collected: at least 700 tweets at the beginning of the timeline are used as training set, and at least 1,500 last tweets are reserved for the test set. The corpus also comprises additional background tweets for each entity (up to 50,000 tweets, with a large variability across entities). This distribution was set in this way to obtain a temporal separation (ideally of several months) between the training and test data. Note that the final amount of available tweets in these sets may be lower, since some posts may have been deleted by the users: in order to respect Twitter’s terms of service, we do not provide the contents of the tweets. The tweet identifiers can be used to retrieve the texts of the posts. We provide a download tool that is similarly to the mechanism used in the TREC Microblog Track in 2011 and 2012.

Data time period: 2013 to 2013

Subjects

User Contributed Tags

Login to tag this record with meaningful keywords to make it easier to discover

Identifiers

Local : 22197793321f8a4061a4172a61a7a313

The RepLab 2013 Dataset

Licence & Rights:

Access:

Full description

This dataset is part of a larger collection

User Contributed Tags

The RepLab 2013 Dataset

Licence & Rights:

Access:

Full description

This dataset is part of a larger collection

Related People

Related Websites

User Contributed Tags