I have developed a number of programs and applications, mainly in Perl and Java. Here are a few programs which you can download and use, although I hold no responsibility for any actions resulting from using them (use it at your own risk). This page is still under development.
Reuse Analysis Workbench
This Java application compares two files (e.g. plain texts) using both n-gram matching and Greedy String Tiling. The program computes similarity and displays results for string matches (for GST only). The tool also uses an implementation of DIFF to create an edit script for how one text can be rewritten into another based on re-arranging GST tiles with either an insert, delete, transposition or move of tile. This Java application works under Windows or Linux/Unix (although maybe some of the icons do not appear under Windows).
Use: java cp raw.jar raw.ReuseWorkbenchMain
Code: JAR file (contact me for the source code)
This Perl script compares 2 files (a and b) and computes the overlap of fixed-length sequences of words (n-grams) between them based on a number of similarity measures. The program computes similarity of n-gram lengths up to the maximum length specified.
Use: perl overlap.pl [-fsq] -a <file a> -b <file b> -n <max n-gram length>
Code: Perl script (zip archive including extras).
Rule-based sentence splitter
This Perl script implements a very simple rule-based sentence splitter and classifies [.!?] as either [sentence_break] or [non_sentence_break]. I based my heuristics and evaluation on a subset of the British National Corpus (BNC) version 1.0 (750,000 sentences). This achieved correct disambiguation of 98.59% on '.', 99.24% on '?' and 98.99% on '!'. I also tested the rules on an unseen set of 52,000 sentences of the Brown corpus and achieved 97.61% on '.', 97.80% on '!' and 98.87% on '?'.
Use: perl splitter.pl -f <file to split> -o
e.g. perl splitter.pl -f input.txt -o
This outputs the split text into a file named input.txt.split
No -f option involkes input from STDIN
No -o option prints the split text to STDOUT
Code: Zip archive (contains documentation and extras)
Literary stylistic programs
EuroWordNet translation module
University of Sheffield
211 Portobello Street,
Sheffield, S1 4DP UK.
|Tel : +44 (0)
Fax : +44 (0) 114 2780300