The St Andrews dataset
consists of 28,133 photographs from
University Library photographic collection which holds one of the largest
and most important collections of historic photography in Scotland.
collection numbers in excess of 300,000 images, 10% of which have been
digitised and used for the ImageCLEF ad hoc retrieval task. Photos are
primarily historic in nature from areas in and around Scotland; although
pictures of other locations also exist.
junction with large ornate fountain with columns, surrounded by rails and lamp
posts at corners; houses and shops.
17 July 1934
unclassified ][ street lamps - or-nate ][ electric street lighting ][ shepherds
& shepherdesses ][ streetscapes ][ shops ]
Example image caption
All images have an
accompanying textual description consisting of 8 distinct fields (see, e.g.
Fig.1). These fields can be used individually or collectively to facilitate
image retrieval. The 28,133 captions consist of 44,085 terms and 1,348,474 word
occurrences; the maximum caption length is 316 words, but on average 48 words
in length. All captions are written in British English, although the language
also contains colloquial expressions. Approximately 81% of captions contain
text in all fields, the rest generally without the description field. In most
cases the image description is a grammatical sentence of around 15 words. The
majority of images (82%) are in black and white, although colour images are
also present in the collection.
type of information that people typically look for in this collection include
More information about the
St Andrews collection as used in ImageCLEF can be found here.
- Social history, e.g. old
towns and villages, children at play and work.
- Environmental concerns,
e.g. lanscapes and wild plants.
- History of photography,
e.g. particular photographers.
- Architecture, e.g.
specific or general places or buildings.
- Golf, e.g. individual
golfers or tournaments.
- Events, e.g. historic, war
- Transport, e.g. general or
specific roads, bridges etc.
- Ships and shipping, e.g.
particular vessels or fishermen.
| Directory structure of the St
St Andews data and unzip and untar the archive file. This file contains the
images and captions which are stored under directories in the format:
stand03_[0-9]+, e.g. stand03_1171. The image filenames are in the form
stand03_[0-9]+.jpg, e.g. stand03_15312.jpg and captions in the same format
except eith suffix .txt, e.g. stand03_1171.txt.
The "docs" directory contains
the following files:
All images have a
- a list of all large images including their pathnames.
- stand03_thumbnails.txt - a list of all thumbnails and their
- a list of all image captions and their pathnames.
- stand03_captions.trec - a single file with all captions in
- stand03_guide.pdf -
an introductory document describing the collection.
| Format of the
The captions are stored
in the directories with the images as plain text files with no encoding. The
captions are also stored in one file (stand03_captions.trec) which contains the
captions in a TREC-type encoding scheme which can be indexed by most
TREC-compliant parsers. For example the first caption is:
|<DOCNO> - stand03_2096/stand03_10695.txt
|<HEADLINE>Departed glories - Falls of Cruachan
Station above Loch Awe on the Oban
</RECORD_ID>(1)Falls of Cruachan Station.
(2)Sheltie dog by single track railway below embankment, with wooden
ticket office, and signals; gnarled trees lining banks. (3)ca.1990
(4)Hamish Macmillan Brown (5)Argyllshire, Scotland
(6)HMBR-273 pc/ADD: The photographer's pet Shetland collie dog,
views],[gamekeepers],[identified male],[dress - national],[dogs]
The majority of the "useful" text for retrieval is
contained between the <TEXT> tags. Each line represents a different
caption field (labelled 1 to 6 for reference only) in the following order:
short title, description, date of registration in the St Andrews collection,
the photographer, the location and additonal notes as provided by the archive