The ACCURAT Project: Wikipedia Evaluation Corpus

Inter-language linked Wikipedia documents contain varying degrees of similarity. An approach is therefore needed to accurately assess similarity within these documents. We aimed to investigate the correlation between different similarity measures (i.e. approach using linguistic resources vs. none), and also to what extent they correlate to human judgment.

We described our methodology in building this evaluation corpus in Paramita et al. (2012). This evaluation corpus contain human judgments for 800 document pairs from 8 language pairs:
- 7 under-resourced language pairs: Greek (EL), Estonian (ET), Croatian (HR), Lithuanian (LT), Latvian (LV), Romanian (RO) and Slovenian (SL) - all are paired with English
- 1 well-resourced language pair: German-English

Given a document pair, two assessors were asked to answer several questions as shown in Figure 1.

Wikipedia Evaluation Scheme
Figure 1. Wikipedia Evaluation Scheme

This evaluation corpus contains two files:

  1. EvaluationDocuments.rar:
    This compressed file contain the list of evaluated documents for each language pair and the *.txt file for each document.
  2. Judgment.txt:
    This tab-separated file contains scores given by assessors for each question in the evaluation scheme. An example of the judgment file is shown in Figure 2. Note:
AssessorId   LangPair   SourceId   TargetId   Q1   Q1-Reasons   Q2   Q3   Q4  
daiva lt-en 192089_lt.txt 15217870_en.txt 3 similarStructure;overlappingNEs;overlappingFragments; 3 3 3
daiva lt-en 195239_lt.txt 4182664_en.txt 1 differentInfo; 2 3 2
daiva lt-en 222904_lt.txt 9465296_en.txt 1 other:"the "parallel" parts (more or less just the bibliography) are not translated into English." 1 1 1
daiva lt-en ... ... ... ... ... ... ...
Figure 2. Example of Judgment File

Contacts

If you have any questions or need further information about this corpus, please contact Monica Paramita (m.paramita@sheffield.ac.uk) or Paul Clough (p.d.clough@sheffield.ac.uk).

Reference

Please cite the following work if you are using this corpus:

Paramita, M., Clough, P., Aker, A. and Gaizauskas, R. (2012) Correlation between Similarity Measures for Inter-Language Linked Wikipedia Articles. In Proceedings of the Eighth international conference on Language Resources and Evaluation (LREC 2012), Turkey, Istanbul.

Acknowledgements

The project has received funding from the ACCURAT Project, European Community Seventh Framework Programme (FP7/2007-2013) under Grant Agreement Number 248347. We also thank all of the 16 assessors from ACCURAT who judged document pairs and provided the human judgments.