Sample Data

Sample_Data has been initially created by participants and some 3rd-party annotators as a discussion material to help define the RITE task.

Development / Training Data

The development data can be downloaded in this section.

Language Description Download
Chinese (Simplified) There are 407 pairs in the development set, which includes initial samples (ID=1~5), created by participants (ID=6~8), created by annotators (ID=9~50), transliterated from CS (ID=51~407 where 3 removed for label disagreements). Pairs are selected so that all annotators agree on the same label. MC
Chinese (Traditional) 421 pairs in the development set has been created for the MC subtask. Due to limited resource and time constraint, we decided this data to be the development data for all subtasks. The pairs have been carefully reviewed and agreed by three annotators. MC
Japanese 10 Japanese annotators studied general trends from the sample data developed by participants, and created data by searching and extracting sentences from a newswire, where search queries have been selected by a random topic generator. They built 1000 BC and 880 MC pairs, and 4 annotators assigned labels where inter-annotator agreement (Fleiss' Kappa) among the 4 is 0.829 for BC and 0.759 for MC. The data has been split into two for the development and test data, each containing 500 and 440 pairs respectively. When needed, editing from original sentences are allowed. Contact us for another version of the data which contains meta data such as original article ID, post-edit indicator, annotator's comments etc. BC, MC

If you could contribute in sharing resources, (e.g. extra annotations by human, automatically-generated word segmentations, parsed results etc), please email the RITE organizers at <ntc9-rite-organizers [at]> so that we can share it among task participants.

Test / Formal Run Data

