Data

The RepLab 2013 Dataset

RMIT University, Australia
Damiano Spina (Principal investigator)
Viewed: [[ro.stat.viewed]] Cited: [[ro.stat.cited]] Accessed: [[ro.stat.accessed]]
ctx_ver=Z39.88-2004&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&rfr_id=info%3Asid%2FANDS&rft_id=http://nlp.uned.es/replab2013/&rft.title=The RepLab 2013 Dataset&rft.identifier=22197793321f8a4061a4172a61a7a313&rft.publisher=RMIT University, Australia&rft.description=RepLab 2013 dataset uses Twitter data in English and Spanish (more than 142,000 tweets). The balance between both languages depends on the availability of data for each of the entities included in the dataset. The corpus consists of a collection of tweets referring to a selected set of 61 entities from four domains: automotive, banking, universities and music/artists. The domain selection was done to offer a variety of scenarios for reputation studies. Crawling was performed during the period from the 1st June 2012 till the 31st Dec 2012 using the entity’s canonical name as query. For each entity, at least 2,200 tweets are collected: at least 700 tweets at the beginning of the timeline are used as training set, and at least 1,500 last tweets are reserved for the test set. The corpus also comprises additional background tweets for each entity (up to 50,000 tweets, with a large variability across entities). This distribution was set in this way to obtain a temporal separation (ideally of several months) between the training and test data. Note that the final amount of available tweets in these sets may be lower, since some posts may have been deleted by the users: in order to respect Twitter’s terms of service, we do not provide the contents of the tweets. The tweet identifiers can be used to retrieve the texts of the posts. We provide a download tool that is similarly to the mechanism used in the TREC Microblog Track in 2011 and 2012.&rft.creator=Damiano Spina&rft.date=2018&rft_rights=All rights reserved&rft_rights=CC BY-NC: Attribution-Noncommercial 3.0 AU http://creativecommons.org/licenses/by-nc/3.0/au&rft_subject=Automatic classification&rft_subject=Taxonomy&rft_subject=Real time processing&rft_subject=Twitter&rft_subject=Information Retrieval and Web Search&rft_subject=INFORMATION AND COMPUTING SCIENCES&rft_subject=LIBRARY AND INFORMATION STUDIES&rft.type=dataset&rft.language=English Access the data

Licence & Rights:

Other view details
Unknown

CC BY-NC: Attribution-Noncommercial 3.0 AU
http://creativecommons.org/licenses/by-nc/3.0/au

All rights reserved

Access:

Other view details

Data available in link

Full description

RepLab 2013 dataset uses Twitter data in English and Spanish (more than 142,000 tweets). The balance between both languages depends on the availability of data for each of the entities included in the dataset. The corpus consists of a collection of tweets referring to a selected set of 61 entities from four domains: automotive, banking, universities and music/artists. The domain selection was done to offer a variety of scenarios for reputation studies. Crawling was performed during the period from the 1st June 2012 till the 31st Dec 2012 using the entity’s canonical name as query. For each entity, at least 2,200 tweets are collected: at least 700 tweets at the beginning of the timeline are used as training set, and at least 1,500 last tweets are reserved for the test set. The corpus also comprises additional background tweets for each entity (up to 50,000 tweets, with a large variability across entities). This distribution was set in this way to obtain a temporal separation (ideally of several months) between the training and test data. Note that the final amount of available tweets in these sets may be lower, since some posts may have been deleted by the users: in order to respect Twitter’s terms of service, we do not provide the contents of the tweets. The tweet identifiers can be used to retrieve the texts of the posts. We provide a download tool that is similarly to the mechanism used in the TREC Microblog Track in 2011 and 2012.

Data time period: 2013 to 2013

This dataset is part of a larger collection

Click to explore relationships graph
Subjects

User Contributed Tags    

Login to tag this record with meaningful keywords to make it easier to discover

Identifiers
  • Local : 22197793321f8a4061a4172a61a7a313