Full description

Wikipedia web pages in different languages are rarely linked except for the cross-lingual link between web pages about the same subject. Collected in June 2010, this data collection consists of 10GB of tagged Chinese, Japanese and Korean articles, converted from Wikipedia to an XML structure by a multi-lingual adaptation of the YAWN system (see Related Information). Data were collected as part of the NII Test Collection for IR Systems (NTCIR) Project, which aims to enhance research in Information Access (IA) technologies, including information retrieval, to enhance cross-lingual link discovery (a way of automatically finding potential links between documents written in different languages). Through cross-lingual link discovery, users are able to discover documents in languages which they are either familiar with, or which have a richer set of documents than in their language of choice.

Data time period: 06 2010 to 30 06 2010

Subjects

User Contributed Tags

Login to tag this record with meaningful keywords to make it easier to discover

Identifiers

DOI : 10.4225/09/587DAC9DD6F7B
Local : 10378.3/8085/1018.13417

Wikipedia CJK Corpora

Licence & Rights:

Access:

Contact Information

Full description

This dataset is part of a larger collection

User Contributed Tags

Quick Links

Explore

External Resources

Share

Wikipedia CJK Corpora

Licence & Rights:

Access:

Contact Information

Full description

This dataset is part of a larger collection

Related Publications

Related People

Related Grants and Projects

Related Websites

User Contributed Tags