Software
I have developed a number of programs and applications,
mainly in Perl and Java. Here are a few programs which you can download and
use, although I hold no responsibility for any actions resulting from using
them (use it at your own risk).
This page is still under development.
Reuse
Analysis WorkbenchThis Java application compares two files (e.g.
plain texts) using both n-gram matching and Greedy String Tiling. The
program computes similarity and displays results for string matches (for GST
only). The tool also uses an implementation of DIFF to create an edit script
for how one text can be rewritten into another based on re-arranging GST tiles
with either an insert, delete, transposition or move of tile. This Java
application works under Windows or Linux/Unix (although maybe some of the icons
do not appear under Windows).
Use: java cp raw.jar
raw.ReuseWorkbenchMain
Code:
JAR
file (contact me for the source code)
N-gram
overlap
This Perl script compares 2 files (a and b) and computes
the overlap of fixed-length sequences of words (n-grams) between them based on
a number of similarity measures. The program computes similarity of n-gram
lengths up to the maximum length specified.
Use: perl overlap.pl [-fsq] -a <file a>
-b <file b> -n <max n-gram length>
Code:
Perl script (zip archive including extras).
Rule-based sentence splitter
This
Perl script implements a very simple rule-based sentence splitter and
classifies [.!?] as either [sentence_break] or [non_sentence_break]. I based my
heuristics and evaluation on a subset of the British National Corpus (BNC)
version 1.0 (750,000 sentences). This achieved correct disambiguation of 98.59%
on '.', 99.24% on '?' and 98.99% on '!'. I also tested the rules on an unseen
set of 52,000 sentences of the Brown corpus and achieved 97.61% on '.', 97.80%
on '!' and 98.87% on '?'.
Use:
perl splitter.pl -f <file to split> -o
e.g. perl splitter.pl -f
input.txt -o
This outputs the split text into a file named
input.txt.split
No -f option involkes input from STDIN
No -o
option prints the split text to STDOUT
Documentation: [
PS][
PDF]
Code:
Zip archive (contains documentation and extras)
Literary
stylistic programs
EuroWordNet translation module
Contact Details:
Information School University of Sheffield Room 226, Regent Court,
211 Portobello Street, Sheffield, S1 4DP UK. |
|
Tel : +44 (0)
114 2222664
Fax : +44 (0) 114 2780300
mailto:
p.d.clough@sheffield.ac.uk
|