Full description

The ClueWeb12 dataset was created to support research on information retrieval and related human language technologies. The dataset consists of 870,043,929 English web pages, collected between February 10, 2012 and May 10, 2012. ClueWeb12 is a companion or successor to the ClueWeb09 web dataset. Distribution of ClueWeb12 began in January 2013. Significance statement The ClueWeb12 dataset is the largest, most complete sample dataset of the broader internet readily available for academic research. Developed by Carnegie-Mellon University, it is commonly seen as the benchmark for large scale information retrieval experiments. Clueweb’s size requires its distribution through standalone HDDs, delivered via international freight. The main savings and efficiencies associated with mirroring this data will come from reductions in bandwidth requirements by permanently co-locating an accessible copy of the full dataset close to computational assets. Subject to Clueweb licensing requirements, the hosting of the dataset will allow other researchers access without repeating the order/delivery/upload process. In regards to our immediate use of this data, our current project relates to social analytics. A common catchphrase in current literature, our research differs markedly by focusing upon the emerging field of large-scale digital forensics. Whereas commercially focused research may identify potential customers for a product, our research is designed to identify indicators of illegal and/or dangerous behavior – for example, child exploitation or recruitment to violent extremism. To date, we have presented our proposals and early progress to international and domestic counter terrorism/extremism researchers and organisations, with a great of deal of interest emerging from foreign government and research bodies.

Created: 2012-02-10 to 2012-05-10

Data time period: 2012-02-10 to 2012-05-10

Spatial Coverage And Location

iso3166: AU

Subjects

User Contributed Tags

Linguistics

Login to tag this record with meaningful keywords to make it easier to discover

Identifiers

Handle : 1959.1/1175957

ClueWeb12

Licence & Rights:

Access:

Full description

This dataset is part of a larger collection

Spatial Coverage And Location

User Contributed Tags

ClueWeb12

Licence & Rights:

Access:

Full description

This dataset is part of a larger collection

Related People

Related Program

Spatial Coverage And Location

User Contributed Tags