A Corpus of Plagiarised Short Answers

Created by Paul Clough (Information Studies) and Mark Stevenson (Computer Science), University of Sheffield.

Plagiarism is widely acknowledged to be a significant and increasing problem for higher education institutions. To test and develop systems to detect plagiarism, evaluation resources are required. To address this, we created a corpus consisting of short (200-300 words) answers to Computer Science questions in which plagiarism has been simulated. The corpus has been designed to represent varying degrees of plagiarism and we envisage will be a useful addition to the set of resources available for the evaluation of plagiarism detection systems. Although a small collection of plagiarised texts, this corpus has been systematically created and we hope will provide a 'blueprint' for the construction of further resources.

This corpus is freely available and consent from participants has been obtained. Please let us know of any problems you encounter or comments you have on the resource.

Construction | Content of the Corpus | Download | License | Citation | Contact


We aimed to create a corpus that could be used for the development and evaluation of plagiarism detection systems that reflects the types of plagiarism practiced by students in an academic setting as far as realistically possible. A set of five short answer questions (A-E) on a variety of topics that might be included in the Computer Science curriculum were created by the authors. For each of these questions a set of answers were obtained using a variety of approaches, some of which simulate cases in which the answer is plagiarised and others that simulate the case in which the answer is not plagiarised. To simulate plagiarism we used a suitable Wikipedia entry as a source text from which participants plagiarised. Four levels of plagiarism are represented in the corpus:

A total of 19 participants were recruited to create texts for the corpus resulting in a total of 95 answers. All participants were students in the Computer Science Department of Sheffield University and were studying for a degree in Computer Science at either undergraduate or postgraduate level. Participation was restricted to students with some familiarity of Computer Science. Participants were presented with each of the five questions and asked to provide a single answer to each. Participants were instructed that answers should be between 200 and 300 words long and, to simplify later processing, should contain only standard (ASCII) characters and avoid using any symbols or computer code. For each question participants were instructed which approach to use to provide the answer. Two of the five questions were answered without plagiarising (the "non-plagiarism" category), one question using the near copy, one using light revision and one using heavy revision. For more details see our LRE journal paper.

Content of the Corpus

The corpus contains 100 documents (95 answers provided by the 19 participants and the five Wikipedia source articles). For each learning task, there are 19 examples of each of the heavy revision, light revision and near copy levels and 38 non-plagiarised examples written independently from the Wikipedia source (there is an uneven spread in the number of answers across tasks and categories results from using a Latin-square arrangement to order the tasks carried out by the 19 participants). The answer texts contain 19,559 words in total (2,2230 unique tokens) and the the Wikipedia pages total 14,242 words after conversion to plaintext using lynx -dump and removal of URL references. The average length of file in the corpus is 208 words (std dev. 64.91) and 113 unique tokens (std dev. 30.11). Overall, 59 (62%) of the files are written by native English speakers; the remaining 36 (38%) by non-native speakers.



Creative Commons License
Corpus of Plagiarised Short Answers by Paul Clough and Mark Stevenson is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.0 UK: England & Wales License


Please cite the corpus with the following reference:

Clough, P. and Stevenson, M. Developing A Corpus of Plagiarised Short Answers, Language Resources and Evaluation: Special Issue on Plagiarism and Authorship Analysis, In Press. [Download]


Dr. Paul Clough (p.d.clough@sheffield.ac.uk)

Dr. Mark Stevenson (m.stevenson@dcs.shef.ac.uk)

Page last updated: 09/10/2009