Results for the adhoc retrieval task
Here are the results for the ImageCLEF 2004 adhoc retrieval task. If you just want the results click here.
This summarises the process of evaluation and the format of results using the trec_eval tool (we are using the version as supplied to us by UMASS and the ireval.pl Perl script which comes with the Lemur toolkit distribution). A comparison between entries and a discussion of the evaluation procedure will be given in the ImageCLEF 2004 overview paper that will appear in this year's CLEF proceedings.
To assess your entries, we did the following:
Several assessors judged the image pools generated from pooling the submissions. The topic creator assessed all 25 topics, and a further 10 assessors judged 5 topics each. This gives us 3 sets of assessments for each topic. To reduce subjectivity in the relevance assessments, we created 6 sets of qrels based on the overlap of relevant images between assessors, and whether partially relevant images were included in the qrels set. To compute the overlap between judgments, we used a voting scheme in which the topic creator was given a count of 2 and other assessors a count of 1. This means for each topic there is a maximum count of 4 (three assessors judging each image in the topic pool). The partially relevant judgment was used to pick up images where the judge thought it was in some way relevant, but could not be entirely confident (e.g. the required subject is in the background of the image). The 6 relevance sets are listed here (with a link to a list files for each topic for each qrels set):
The qrel files contain files (1 per topic) which list relevant images in ascending order of the image ID (these are unique identifiers). For evaluation in ImageCLEF we have used the pisec-total qrels set. This says that the topic creator and at least 1 other person agree on an image as relevant (either as relevant or partially relevant - it does not matter for this set). I have empirical evidence to suggest that the selection of a qrels set (from the 6 above) does not have a large effect on the system rankings (only on the mean average precision score itself).
Qrels for the partial_isec-total qrels set can be found in the TREC (4 column) format here which will work with ireval.pl and trec_eval.Evaluation
Given your submission, we went through a process of identifying and marking documents in the ranked list as relevant or not based on the 6 sets of relevant documents. To enable comparison between other participants, we used a method where relevant documents not found in the top 1000 results are assigned to a rank position starting from 28133. We used the UMASS and Lemur versions of the standard trec_eval tool to compute the mean average precision scores for your submission. This provides the "standard" information retrieval evaluation measures, e.g. precision at a given rank cut-off, average precision across 11 recall points, and single-valued summaries for each measure. We have computed the scores across each topic so you can inspect performance for individual queries, as well as across all 26 topics.
If you want to evaluate your own systems, follow these instructions.
If you need further details of the evaluation process, have any questions or problems with interpreting the results then please don't hesitate to contact Paul Clough.
Results are based on the pisec-total qrels set and systems are ranked based on their uninterpolated mean average precision (MAP) score across all 25 topics. Submissions are listed by run identifier (a list of which run ids relate to which groups can be found here). The results listed include all submitted runs to ImageCLEF, but in our final presentation we will separate results for various categories of run (e.g. feedback/expansion, modality etc.). Based on information sent to me, I have included this in the results (an Excel spreadsheet and a CSV file) for the pisec-total qrels set. I have computed % of monolingual based on the highest MAP score for monolingual (0.5865 for the pisec-total set). Initially I have broken down the results by language and listed up to the top 10 results.
Official results (pisec-total) [csv] [Excel]*
*Please note we still need to finally confirm the parameters used for each run (I am waiting for final confirmation).
Output from trec_eval
You can get a summary of the trec_eval output (called <runid>.res_short) and an output for each topic (called <runid>.res_long) for each of the qrels sets (but please note we only used the pisec-total qrels set for evaluation). These can be used, for example, to analyse the results on a query-by-query basis, produce precision-recall graphs etc. If you require some tools to process the trec_eval output then contact Paul Clough because I may have something to help you.
What to do next
You can continue experimenting with your systems using the qrels supplied (make sure you use the pisec-total qrels file), but please be ready to submit a paper to Carol Peters at CLEF by 15th August. If you are familiar with trec_eval then you should be able to make use of the qrels files in further system evaluation, otherwise you can analyse and use results from the trec_eval output supplied by us and given in the above table.
Thanks and acknowledgements
We thank everyone who participated in ImageCLEF 2004 to make this such an interesting and successful evaluation. In particular we thank St Andrews University Library (esp. Norman Reid) for letting us use their collection. What makes this evaluation possible are the relevance assessments and we want to thank Hideo Joho, Simon Tucker, Mark Sanderson, Steve Whittaker, Wim Peters, Diego Uribe, Horacio Saggion, and Mark Sanderson.
My thanks also go out to those people involved in translating the captions including Jian-Yun Nie, Jesper Kallehauge, Assad Alberair, Hiedi Christensen, Xiao Mang Shou, Michael Bonn, Maarten de Rijke, Henning Mueller, Diego Uribe, Jussi Karlgren, Carol Peters, Eija Airio, Natalia Loukachevitch and Hideo Joho.
Page Maintained by Paul Clough
© University of Sheffield 2004