Data

Wikipedia CJK Corpora

Queensland University of Technology
Tang, Ling-Xiang ; Geva, Shlomo
Viewed: [[ro.stat.viewed]] Cited: [[ro.stat.cited]] Accessed: [[ro.stat.accessed]]
ctx_ver=Z39.88-2004&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&rfr_id=info%3Asid%2FANDS&rft_id=info:doi10.4225/09/587dac9dd6f7b&rft.title=Wikipedia CJK Corpora&rft.identifier=10.4225/09/587dac9dd6f7b&rft.publisher=Queensland University of Technology&rft.description=Wikipedia web pages in different languages are rarely linked except for the cross-lingual link between web pages about the same subject. Collected in June 2010, this data collection consists of 10GB of tagged Chinese, Japanese and Korean articles, converted from Wikipedia to an XML structure by a multi-lingual adaptation of the YAWN system (see Related Information). Data were collected as part of the NII Test Collection for IR Systems (NTCIR) Project, which aims to enhance research in Information Access (IA) technologies, including information retrieval, to enhance cross-lingual link discovery (a way of automatically finding potential links between documents written in different languages). Through cross-lingual link discovery, users are able to discover documents in languages which they are either familiar with, or which have a richer set of documents than in their language of choice.&rft.creator=Tang, Ling-Xiang &rft.creator=Geva, Shlomo &rft.date=2012&rft.edition=1&rft.relation=http://eprints.qut.edu.au/49125/&rft.relation=http://eprints.qut.edu.au/49128/&rft.relation=http://eprints.qut.edu.au/49127/&rft.coverage=&rft_rights=©&rft_rights=Creative Commons Attribution-Share Alike 3.0 http://creativecommons.org/licenses/by-sa/3.0/au/&rft_subject=Link recommendation&rft_subject=Information Retrieval and Web Search&rft_subject=INFORMATION AND COMPUTING SCIENCES&rft_subject=LIBRARY AND INFORMATION STUDIES&rft_subject=Evaluation tool&rft_subject=Cross-lingual link discovery (CLLD)&rft_subject=Assessment tool&rft_subject=Wikipedia&rft_subject=Evaluation metrics&rft_subject=Anchor identification&rft_subject=Validation tool&rft.type=dataset&rft.language=English Access the data

Licence & Rights:

Open Licence view details
CC-BY-SA

Creative Commons Attribution-Share Alike 3.0
http://creativecommons.org/licenses/by-sa/3.0/au/

©

Access:

Other

Contact Information

Postal Address:
Shlomo Geva

s.geva@qut.edu.au

Full description

Wikipedia web pages in different languages are rarely linked except for the cross-lingual link between web pages about the same subject. Collected in June 2010, this data collection consists of 10GB of tagged Chinese, Japanese and Korean articles, converted from Wikipedia to an XML structure by a multi-lingual adaptation of the YAWN system (see Related Information). Data were collected as part of the NII Test Collection for IR Systems (NTCIR) Project, which aims to enhance research in Information Access (IA) technologies, including information retrieval, to enhance cross-lingual link discovery (a way of automatically finding potential links between documents written in different languages). Through cross-lingual link discovery, users are able to discover documents in languages which they are either familiar with, or which have a richer set of documents than in their language of choice.

Data time period: 06 2010 to 30 06 2010

This dataset is part of a larger collection

Click to explore relationships graph
Identifiers