Software



I have developed a number of programs and applications, mainly in Perl and Java. Here are a few programs which you can download and use, although I hold no responsibility for any actions resulting from using them (use it at your own risk). This page is still under development.


Reuse Analysis Workbench


This Java application compares two files (e.g. plain texts) using  both n-gram matching and Greedy String Tiling. The program computes similarity and displays results for string matches (for GST only). The tool also uses an implementation of DIFF to create an edit script for how one text can be rewritten into another based on re-arranging GST tiles with either an insert, delete, transposition or move of tile. This Java application works under Windows or Linux/Unix (although maybe some of the icons do not appear under Windows).

Reuse Analysis Workbench

Use: java –cp raw.jar raw.ReuseWorkbenchMain

CodeJAR file (contact me for the source code)


N-gram overlap

This Perl script compares 2 files (a and b) and computes the overlap of fixed-length sequences of words (n-grams) between them based on a number of similarity measures. The program computes similarity of n-gram lengths up to the maximum length specified.

Use: perl overlap.pl [-fsq] -a <file a> -b <file b> -n <max n-gram length>

CodePerl script (zip archive including extras).


Rule-based sentence splitter

This Perl script implements a very simple rule-based sentence splitter and classifies [.!?] as either [sentence_break] or [non_sentence_break]. I based my heuristics and evaluation on a subset of the British National Corpus (BNC) version 1.0 (750,000 sentences). This achieved correct disambiguation of 98.59% on '.', 99.24% on '?' and 98.99% on '!'. I also tested the rules on an unseen set of 52,000 sentences of the Brown corpus and achieved 97.61% on '.', 97.80% on '!' and 98.87% on '?'.

Use: perl splitter.pl -f <file to split> -o
e.g. perl splitter.pl -f input.txt -o

This outputs the split text into a file named input.txt.split

No -f option involkes input from STDIN

No -o option prints the split text to STDOUT

Documentation: [PS][PDF]

CodeZip archive (contains documentation and extras)


Literary stylistic programs



EuroWordNet translation module


Contact Details:

Information School
University of Sheffield
Room 226,
Regent Court,
211 Portobello Street,
Sheffield, S1 4DP UK.
  Tel : +44 (0) 114 2222664
Fax : +44 (0) 114 2780300
mailto: p.d.clough@sheffield.ac.uk