Inter-language linked Wikipedia documents contain varying degrees of similarity. An approach is therefore needed to accurately assess similarity within these documents. We aimed to investigate the correlation between different similarity measures (i.e. approach using linguistic resources vs. none), and also to what extent they correlate to human judgment.
We described our methodology in building this evaluation corpus in Paramita et al. (2012). This evaluation corpus contain human judgments for 800 document pairs from 8 language pairs:
- 7 under-resourced language pairs: Greek (EL), Estonian (ET), Croatian (HR), Lithuanian (LT), Latvian (LV), Romanian (RO) and Slovenian (SL) - all are paired with English
- 1 well-resourced language pair: German-English
Given a document pair, two assessors were asked to answer several questions as shown in Figure 1.
This evaluation corpus contains two files:
|daiva||lt-en||222904_lt.txt||9465296_en.txt||1||other:"the "parallel" parts (more or less just the bibliography) are not translated into English."||1||1||1|
If you have any questions or need further information about this corpus, please contact Monica Paramita (firstname.lastname@example.org) or Paul Clough (email@example.com).