Results for the adhoc retrieval task

Here are the results for the ImageCLEF 2004 adhoc retrieval task. If you just want the results click here.

Introduction

This summarises the process of evaluation and the format of results using the trec_eval tool (we are using the version as supplied to us by UMASS and the ireval.pl Perl script which comes with the Lemur toolkit distribution). A comparison between entries and a discussion of the evaluation procedure will be given in the ImageCLEF 2004 overview paper that will appear in this year's CLEF proceedings.

To assess your entries, we did the following:
  1. Extracted the top 50 runs from each submission (we used a total of 190 submissions).
  2. Computed the union of runs to create a document pool for each topic.
  3. Manually assessed images in the document pool using three assessors (images judged as relevant and partially relevant).
  4. Created 6 sets of relevant images for each topic (6 qrels sets).
  5. Compared each system run against the qrels.
  6. Computed uninterpolated mean average precision across all topics using trec_eval (links to the output from trec_eval for individual topics can be found on this web page).
Assessments and qrels

Several assessors judged the image pools generated from pooling the submissions. The topic creator assessed all 25 topics, and a further 10 assessors judged 5 topics each. This gives us 3 sets of assessments for each topic. To reduce subjectivity in the relevance assessments, we created 6 sets of qrels based on the overlap of relevant images between assessors, and whether partially relevant images were included in the qrels set. To compute the overlap between judgments, we used a voting scheme in which the topic creator was given a count of 2 and other assessors a count of 1. This means for each topic there is a maximum count of 4 (three assessors judging each image in the topic pool). The partially relevant judgment was used to pick up images where the judge thought it was in some way relevant, but could not be entirely confident (e.g. the required subject is in the background of the image). The 6 relevance sets are listed here (with a link to a list files for each topic for each qrels set):
  1. isec-rel: images judged as relevant by all three assessors [zip].
  2. isec-total: images judged as either relevant or partially relevant by all three assessors [zip].
  3. pisec-rel: images judged as relevant by the topic creator and 1 other assessor [zip].
  4. pisec-total: images judged as either relevant or partially relevant by the topic creator and 1 other assessor [zip].
  5. union_rel: images judged as relevant by at least 1 assessor [zip].
  6. union_total: images judged as either relevant or partially relevant by at least 1 assessor [zip].

The qrel files contain files (1 per topic) which list relevant images in ascending order of the image ID (these are unique identifiers). For evaluation in ImageCLEF we have used the pisec-total qrels set. This says that the topic creator and at least 1 other person agree on an image as relevant (either as relevant or partially relevant - it does not matter for this set). I have empirical evidence to suggest that the selection of a qrels set (from the 6 above) does not have a large effect on the system rankings (only on the mean average precision score itself).

Qrels for the partial_isec-total qrels set can be found in the TREC (4 column) format here which will work with ireval.pl and trec_eval.

Evaluation

Given your submission, we went through a process of identifying and marking documents in the ranked list as relevant or not based on the 6 sets of relevant documents. To enable comparison between other participants, we used a method where relevant documents not found in the top 1000 results are assigned to a rank position starting from 28133. We used the UMASS and Lemur versions of the standard trec_eval tool to compute the mean average precision scores for your submission. This provides the "standard" information retrieval evaluation measures, e.g. precision at a given rank cut-off, average precision across 11 recall points, and single-valued summaries for each measure. We have computed the scores across each topic so you can inspect performance for individual queries, as well as across all 26 topics.

If you want to evaluate your own systems, follow these instructions.

If you need further details of the evaluation process, have any questions or problems with interpreting the results then please don't hesitate to contact Paul Clough.


Results

Results are based on the pisec-total qrels set and systems are ranked based on their uninterpolated mean average precision (MAP) score across all 25 topics. Submissions are listed by run identifier (a list of which run ids relate to which groups can be found here). The results listed include all submitted runs to ImageCLEF, but in our final presentation we will separate results for various categories of run (e.g. feedback/expansion, modality etc.). Based on information sent to me, I have included this in the results (an Excel spreadsheet and a CSV file) for the pisec-total qrels set. I have computed % of monolingual based on the highest MAP score for monolingual (0.5865 for the pisec-total set). Initially I have broken down the results by language and listed up to the top 10 results.
         
Group  Submission ID MAP %monolingual
Monolingual        
  daedalus mirobaseen 0.5865 na
  daedalus enenrunexp1 0.5838 na
  sheffield en_en_fb 0.5829 na
  daedalus mirosbaseen 0.5623 na
  montreal UMenTNFBTI 0.562 na
  daedalus miroppbaseen 0.5609 na
  ntu NTU-adhoc-EE-T-W 0.5463 na
  daedalus mirosppbaseen 0.5388 na
  daedalus enenrunexp7 0.5339 na
  cea lic2mSAen1t 0.4289 na
  cea lic2mSAen2ti 0.428 na
Chinese        
  ntu NTU-adhoc-CE-T-WE 0.4171 71.12
  ntu NTU-adhoc-CE-T-WEI 0.4124 70.32
  ntu NTU-adhoc-CE-T-W 0.3977 67.81
  ntu NTU-adhoc-CE-T-WI 0.3969 67.67
  msu msustat2 0.2935 50.04
  msu msustat1 0.2935 50.04
  KIDS kids_dict 0.2796 47.67
  KIDS kids_onto 0.2769 47.21
  msu msusystran1 0.2458 41.91
  msu msusystran2 0.2115 36.06
Dutch        
  dcu nllsstimg 0.4321 73.67
  dcu nlstimgfbk3 0.4319 73.64
  dcu nlstimgal 0.4273 72.86
  dcu nlmgimgal 0.4219 71.94
  dcu nlmgimgfbk3 0.4207 71.73
  dcu nllsmgimg 0.4188 71.41
  montreal UMnlTFBTI 0.4004 68.27
  dcu nlsdlimgfbk3 0.3983 67.91
  dcu nllssdlimg 0.3944 67.25
  dcu nlbasest 0.3838 65.44
  daedalus mirobasedu 0.3807 64.91
Finnish        
  montreal UMfiTFBTI 0.2347 40.02
  daedalus mirobasefi 0.17 28.99
French        
  montreal UMfrTFBTI 0.5125 87.40
  dcu frintimgfbk1 0.4662 79.50
  dcu frlsintimg 0.4656 79.40
  sheffield fr_fr_fb 0.4365 74.44
  dcu frstimgfbk1 0.431 73.50
  dcu frlsstimg 0.4291 73.18
  dcu frbasest 0.4274 72.89
  dcu frstimgal 0.4254 72.54
  dcu frsdlimgfbk1 0.4088 69.71
  dcu frlssdlimg 0.4066 69.34
  dcu frlsmgimg 0.3997 68.16
German        
  dcu delsmgimg 0.5327 90.84
  dcu demgimgfbk3 0.5318 90.69
  dcu demgimgal 0.5312 90.59
  dcu delssdlimg 0.5017 85.56
  dcu desdlimgfbk3 0.5005 85.35
  dcu delsstimg 0.4737 80.78
  dcu destimgfbk3 0.4735 80.75
  dcu deintimgfbk3 0.468 79.81
  dcu destimgal 0.4679 79.79
  dcu delsintimg 0.4669 79.62
  dcu debasest 0.4639 79.11
Italian        
  dcu itstimgfbk3 0.4381 74.71
  dcu itlsstimg 0.4379 74.68
  sheffield it_it_fb 0.4355 74.27
  dcu itstimgal 0.4341 74.03
  dcu itbasest 0.402 68.55
  dcu itlssdlimg 0.3708 63.23
  dcu itsdlimgfbk3 0.3659 62.40
  montreal UMitTFBTI 0.3597 61.34
  dcu itmgimgal 0.3538 60.33
  dcu itlsmgimg 0.3515 59.94
  dcu itmgimgfbk3 0.3512 59.89
Japanese        
  daedalus mirobaseja 0.2358 40.21
  alicante ALCim04jp0 0.2256 38.47
  alicante ALCim04jp1 0.1555 26.52
  alicante ALCim04jp2 0.1427 24.33
Russian        
  daedalus mirobaseru 0.3866 65.93
  alicante ALCim04ru0 0.1472 25.10
  alicante ALCim04ru2 0.1441 24.57
  alicante ALCim04ru1 0.136 23.19
Spanish        
  sheffield es_es_fb 0.5211 88.86
  uned UNEDESENT 0.5171 88.18
  montreal UMesTFBTI 0.489 83.39
  uned UNEDES 0.4827 82.32
  dcu reessdlimg 0.4732 80.70
  uned UNEDESENTNOO 0.4671 79.66
  montreal UMesRevTFBTI 0.4505 76.82
  dcu reesmgimg 0.4464 76.13
  dcu essdlimgfbk2 0.4436 75.65
  dcu eslssdlimg 0.4404 75.10
  uned UNEDORENTNOO 0.4218 71.93
Swedish        
  montreal UMsvTFBTI 0.34 57.98
  daedalus mirobasesw 0.3043 51.89

Official results (pisec-total) [csv] [Excel]*
*Please note we still need to finally confirm the parameters used for each run (I am waiting for final confirmation).

Output from trec_eval

You can get a summary of the trec_eval output (called <runid>.res_short) and an output for each topic (called <runid>.res_long) for each of the qrels sets (but please note we only used the pisec-total qrels set for evaluation). These can be used, for example, to analyse the results on a query-by-query basis, produce precision-recall graphs etc. If you require some tools to process the trec_eval output then contact Paul Clough because I may have something to help you.
   
isec-rel isec-total
pisec-rel pisec-total
union-rel union-total


What to do next

You can continue experimenting with your systems using the qrels supplied (make sure you use the pisec-total qrels file), but please be ready to submit a paper to Carol Peters at CLEF by 15th August. If you are familiar with trec_eval then you should be able to make use of the qrels files in further system evaluation, otherwise you can analyse and use results from the trec_eval output supplied by us and given in the above table.


Thanks and acknowledgements

We thank everyone who participated in ImageCLEF 2004 to make this such an interesting and successful evaluation. In particular we thank St Andrews University Library (esp. Norman Reid) for letting us use their collection. What makes this evaluation possible are the relevance assessments and we want to thank Hideo Joho, Simon Tucker, Mark Sanderson, Steve Whittaker, Wim Peters, Diego Uribe, Horacio Saggion, and Mark Sanderson.


My thanks also go out to those people involved in translating the captions including Jian-Yun Nie, Jesper Kallehauge, Assad Alberair, Hiedi Christensen, Xiao Mang Shou, Michael Bonn, Maarten de Rijke, Henning Mueller, Diego Uribe, Jussi Karlgren, Carol Peters, Eija Airio, Natalia Loukachevitch and Hideo Joho.
 

Page Maintained by Paul Clough

© University of Sheffield 2004