Data

ClueWeb12

Monash University
Dr Campbell Wilson (hasOwner)
Viewed: [[ro.stat.viewed]] Cited: [[ro.stat.cited]] Accessed: [[ro.stat.accessed]]
ctx_ver=Z39.88-2004&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&rfr_id=info%3Asid%2FANDS&rft_id=1959.1/1175957&rft.title=ClueWeb12&rft.identifier=1959.1/1175957&rft.publisher=Monash University&rft.description=The ClueWeb12 dataset was created to support research on information retrieval and related human language technologies. The dataset consists of 870,043,929 English web pages, collected between February 10, 2012 and May 10, 2012. ClueWeb12 is a companion or successor to the ClueWeb09 web dataset. Distribution of ClueWeb12 began in January 2013. Significance statement The ClueWeb12 dataset is the largest, most complete sample dataset of the broader internet readily available for academic research. Developed by Carnegie-Mellon University, it is commonly seen as the benchmark for large scale information retrieval experiments. Clueweb’s size requires its distribution through standalone HDDs, delivered via international freight. The main savings and efficiencies associated with mirroring this data will come from reductions in bandwidth requirements by permanently co-locating an accessible copy of the full dataset close to computational assets. Subject to Clueweb licensing requirements, the hosting of the dataset will allow other researchers access without repeating the order/delivery/upload process. In regards to our immediate use of this data, our current project relates to social analytics. A common catchphrase in current literature, our research differs markedly by focusing upon the emerging field of large-scale digital forensics. Whereas commercially focused research may identify potential customers for a product, our research is designed to identify indicators of illegal and/or dangerous behavior – for example, child exploitation or recruitment to violent extremism. To date, we have presented our proposals and early progress to international and domestic counter terrorism/extremism researchers and organisations, with a great of deal of interest emerging from foreign government and research bodies. &rft.creator=Anonymous&rft.date=1970&rft.relation=&rft.coverage=AU&rft_rights=Some rights reserved.&rft_rights=Organization Agreement to use the ClueWeb12 Web Research Collections&rft_subject=Pattern Recognition and Data Mining&rft_subject=INFORMATION AND COMPUTING SCIENCES&rft_subject=ARTIFICIAL INTELLIGENCE AND IMAGE PROCESSING&rft_subject=Forensic Statistics&rft_subject=MATHEMATICAL SCIENCES&rft_subject=STATISTICS&rft_subject=MASSIVE&rft_subject=NeCTAR&rft_subject=cloud computing&rft.type=dataset&rft.language=English Access the data

Licence & Rights:

Other view details
Other

Organization Agreement to use the ClueWeb12 Web Research Collections

Some rights reserved.

Access:

Conditions apply view details

Access to the data collection may be provided by negotiation. To discuss terms and conditions contact: campbell.wilson@monash.edu

Full description

The ClueWeb12 dataset was created to support research on information retrieval and related human language technologies. The dataset consists of 870,043,929 English web pages, collected between February 10, 2012 and May 10, 2012. ClueWeb12 is a companion or successor to the ClueWeb09 web dataset. Distribution of ClueWeb12 began in January 2013. Significance statement The ClueWeb12 dataset is the largest, most complete sample dataset of the broader internet readily available for academic research. Developed by Carnegie-Mellon University, it is commonly seen as the benchmark for large scale information retrieval experiments. Clueweb’s size requires its distribution through standalone HDDs, delivered via international freight. The main savings and efficiencies associated with mirroring this data will come from reductions in bandwidth requirements by permanently co-locating an accessible copy of the full dataset close to computational assets. Subject to Clueweb licensing requirements, the hosting of the dataset will allow other researchers access without repeating the order/delivery/upload process. In regards to our immediate use of this data, our current project relates to social analytics. A common catchphrase in current literature, our research differs markedly by focusing upon the emerging field of large-scale digital forensics. Whereas commercially focused research may identify potential customers for a product, our research is designed to identify indicators of illegal and/or dangerous behavior – for example, child exploitation or recruitment to violent extremism. To date, we have presented our proposals and early progress to international and domestic counter terrorism/extremism researchers and organisations, with a great of deal of interest emerging from foreign government and research bodies.

Created: 2012-02-10 to 2012-05-10

Data time period: 2012-02-10 to 2012-05-10

This dataset is part of a larger collection

Click to explore relationships graph

Spatial Coverage And Location

iso3166: AU

Identifiers