Abstract
Evaluating search system effectiveness is a foundational hallmark of information retrieval research. Doing so requires infrastructure appropriate for the task at hand, which generally follows the Cranfield paradigm: test collections and associated evaluation measures. A primary purpose of Information Retrieval (IR) evaluation campaigns such as Text REtrieval Conference (TREC) and Conference and Labs of the Evaluation Forum (CLEF) is to build this infrastructure. The first TREC collections targeted the same task as the original Cranfield tests and used measures that were familiar to test collection users of the time. But as evaluation tasks have multiplied and diversified, test collection construction techniques and evaluation measure definitions have also been forced to evolve. This chapter examines how the Cranfield paradigm has been adapted to meet the changing requirements for search systems enabling it to continue to support a vibrant research community.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Al-Maskari A, Sanderson M, Clough P, Airio E (2008) The good and the bad system: does the test collection predict users’ effectiveness? In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’08. ACM, New York, pp 59–66
Allan J (2003) HARD Track overview in TREC 2003: high accuracy retrieval from documents. In: Proceedings of the twelfth Text REtrieval Conference (TREC 2003)
Aslam J, Ekstrand-Abueg M, Pavlu V, Diaz F, McCreadie R, Sakai T (2014) TREC 2014 temporal summarization track overview. In: Proceedings of the twenty-third Text REtrieval Conference (TREC 2014)
Banks D, Over P, Zhang NF (1999) Blind men and elephants: six approaches to TREC data. Inf Retr 1:7–34
Bellot P, Bogers T, Geva S, Hall MA, Huurdeman HC, Kamps J, Kazai G, Koolen M, Moriceau V, Mothe J, Preminger M, SanJuan E, Schenkel R, Skov M, Tannier X, Walsh D (2014) Overview of INEX 2014. In: Kanoulas E, Lupu M, Clough P, Sanderson M, Hall M, Hanbury A, Toms E (eds) Information access evaluation – multilinguality, multimodality, and interaction. Proceedings of the fifth international conference of the CLEF initiative (CLEF 2014). Lecture notes in computer science (LNCS), vol 8685. Springer, Heidelberg, pp 212–228
Broder A (2002) A taxonomy of web search. SIGIR Forum 36(2):3–10
Buckley C (2001) The TREC-9 query track. In: Voorhees E, Harman D (eds) Proceedings of the ninth Text REtreival Conference (TREC-9), pp 81–85
Buckley C, Voorhees EM (2000) Evaluating evaluation measure stability. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 2000, pp 33–40
Buckley C, Voorhees EM (2004) Retrieval evaluation with incomplete information. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, pp 25–32
Buckley C, Voorhees EM (2005) Retrieval system evaluation. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. MIT Press, Boston, chap 3, pp 53–75
Buckley C, Dimmick D, Soboroff I, Voorhees E (2007) Bias and the limits of pooling for large collections. Inf Retr 10:491–508
Burges C, Shaked T, Renshaw E, Lazier A, Deeds M, Hamilton N, Hullender G (2005) Learning to rank using gradient descent. In: Proceedings of the 22nd international conference on machine learning, ICML ’05. ACM, New York, pp 89–96
Burgin R (1992) Variations in relevance judgments and the evaluation of retrieval performance. Inf Process Manag 28(5):619–627
Carterette B (2011) System effectiveness, user models, and user utility: a conceptual framework for investigation. In: Proceedings of the 34th International ACM SIGIR conference on research and development in information retrieval (SIGIR’11). ACM, New York, pp 903–912
Carterette BA (2012) Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Trans Inf Syst 30(1):4:1–4:34
Carterette B (2015) The best published result is random: Sequential testing and its effect on reported effectiveness. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’15, pp 747–750
Carterette B, Allan J, Sitaraman R (2006) Minimal test collection for retrieval evaluation. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pp 268–275
Chapelle O, Ji S, Liao C, Velipasaoglu E, Lai L, Wu SL (2011) Intent-based diversification of web search results: metrics and algorithms. Inf Retr 14(6):572–592
Clarke CL, Kolla M, Cormack GV, Vechtomova O, Ashkan A, Büttcher S, MacKinnon I (2008) Novelty and diversity in information retrieval evaluation. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’08. ACM, New York, pp 659–666
Clarke CL, Craswell N, Soboroff I, Ashkan A (2011) A comparative analysis of cascade measures for novelty and diversity. In: Proceedings of the fourth ACM international conference on web search and data mining, WSDM ’11. ACM, New York, pp 75–84
Cleverdon CW (1967) The Cranfield tests on index language devices. In: Aslib proceedings, vol 19, pp 173–192, (Reprinted in Readings in Information Retrieval, K. Spärck-Jones and P. Willett, editors, Morgan Kaufmann, 1997)
Cleverdon CW (1970) The effect of variations in relevance assessments in comparative experimental tests of index languages. Tech. Rep. Cranfield Library Report No. 3, Cranfield Institute of Technology, Cranfield, UK
Cleverdon CW (1991) The significance of the Cranfield tests on index languages. In: Proceedings of the fourteenth annual international ACM/SIGIR conference on research and development in information retrieval, pp 3–12
Cormack GV, Palmer CR, Clarke CLA (1998) Efficient construction of large test collections. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98. ACM, New York, pp 282–289
Cuadra CA, Katter RV (1967) Opening the black box of relevance. J Doc 23(4):291–303
Gilbert H, Spärck Jones K (1979) Statistical bases of relevance assessment for the ‘IDEAL’ information retrieval test collection. Tech. rep., Computer Laboratory, University of Cambridge. Available at http://sigir.org/resources/museum/
Guiver J, Mizzaro S, Robertson S (2009) A few good topics: experiments in topic set reduction for retrieval evaluation. ACM Trans Inf Syst 27(4):21:1–21:26
Harman D (1996) Overview of the fourth Text REtrieval Conference (TREC-4). In: Harman DK (ed) Proceedings of the fourth Text REtrieval Conference (TREC-4), pp 1–23, nIST Special Publication 500-236
Harter SP (1996) Variations in relevance assessments and the measurement of retrieval effectiveness. J Am Soc Inf Sci 47(1):37–49
Hawking D, Craswell N (2005) The very large collection and web tracks. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. MIT Press, Boston, chap 9, pp 199–231
Hersh W, Turpin A, Price S, Chan B, Kraemer D, Sacherek L, Olson D (2000) Do batch and user evaluations give the same results? In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 2000, pp 17–24
Hofmann K, Li L, Radlinski F (2016) Online evaluation for information retrieval. Found Trends Inf Retr 10(1):1–117
Hosseini M, Cox IJ, Milic-Frayling N, Shokouhi M, Yilmaz E (2012) An uncertainty-aware query selection model for evaluation of IR systems. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’12. ACM, New York, pp 901–910
Hull D (1993) Using statistical testing in the evaluation of retrieval experiments. In: Proceedings of the 16th annual international ACM/SIGIR conference on research and development in information retrieval, SIGIR ’93. ACM, New York, pp 329–338
Järvelin K, Kekäläinen J (2002) Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst 20(4):422–446
Kando N, Kuriyama K, Nozue T, Eguchi K, Kato H, Hidaka S (1999) Overview of IR tasks at the first NTCIR workshop. In: Proceedings of the first NTCIR workshop on research in Japanese text retrieval and term recognition, pp 11–44
Keen EM (1966) Measures and averaging methods used in performance testing of indexing systems. Tech. rep., The College of Aeronautics, Cranfield, England. Available at http://sigir.org/resources/museum/
Kelly D (2009) Methods for evaluating interactive information retrieval systems with users. Found Trends Inf Retr 3(1–2):1–224
Kutlu M, Elsayed T, Lease M (2018) Learning to effectively select topics for information retrieval test collections. Inf Process Manag 54(1):37–59
Lalmas M, Tombros A (2007) Evaluating XML retrieval effectiveness at INEX. SIGIR Forum 41(1):40–57
Ledwith R (1992) On the difficulties of applying the results of information retrieval research to aid in the searching of large scientific databases. Inf Process Manag 28(4):451–455
Lesk ME (1967) SIG – the significance programs for testing the evaluation output. In: Information storage and retrieval, Scientific Report No. ISR-12, National Science Foundation, chap II
Lesk M, Salton G (1969) Relevance assessments and retrieval system evaluation. Inf Storage Retr 4:343–359
Losada DE, Parapar J, Barreiro A (2016) Feeling lucky?: multi-armed bandits for ordering judgements in pooling-based evaluation. In: Proceedings of the 31st annual ACM symposium on applied computing, SAC ’16. ACM, New York, pp 1027–1034
Mandl T, Womser-Hacker C (2003) Linguistic and statistical analysis of the clef topics. In: Peters C, Braschler M, Gonzalo J, Kluck M (eds) Advances in cross-language information retrieval: third workshop of the cross-language evaluation forum (CLEF 2002) revised papers. Lecture notes in computer science (LNCS), vol 2785. Springer, Heidelberg, pp 505–511
Moffat A, Zobel J (2008) Rank-biased precision for measurement of retrieval effectiveness. ACM Trans Inf Syst 27(1):Article 2
Robertson S (2008) On the history of evaluation in IR. J Inf Sci 34(4):439–456
Robertson S, Callan J (2005) Routing and filtering. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. MIT Press, Boston, chap 5, pp 99–121
Robertson S, Hancock-Beaulieu M (1992) On the evaluation of IR systems. Inf Process Manag 28(4):457–466
Sakai T (2006) Evaluating evaluation metrics based on the bootstrap. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’06. ACM, New York, pp 525–532
Sakai T (2008a) Comparing metrics across TREC and NTCIR: the robustness to pool depth bias. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, pp 691–692
Sakai T (2008b) Comparing metrics across TREC and NTCIR: the robustness to system bias. In: Proceedings of the 17th ACM conference on information and knowledge management, pp 581–590
Sakai T (2014) Metrics, statistics, tests. In: Ferro N (ed) 2013 PROMISE winter school: bridging between information retrieval and databases. Lecture notes in computer science (LNCS), vol 8173 . Springer, Heidelberg, pp 116–163
Sakai T (2016) Statistical significance, power, and sample sizes: a systematic review of SIGIR and TOIS, 2006-2015. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’16. ACM, New York, pp 5–14
Sakai T, Kando N (2008) On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inf Retr 11:447–470
Sanderson M (2010) Test collection based evaluation of information retrieval systems. Found Trends Inf Retr 4(4):247–375
Sanderson M, Zobel J (2005) Information retrieval system evaluation: effort, sensitivity, and reliability. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’05. ACM, New York, pp 162–169
Savoy J (1997) Statistical inference in retrieval effectiveness evaluation. Inf Process Manag 33(4):495–512
Smucker MD, Allan J, Carterette B (2007) A comparison of statistical significance tests for information retrieval evaluation. In: Proceedings of the sixteenth ACM conference on conference on information and knowledge management, CIKM ’07. ACM, New York, pp 623–632
Soboroff I, Robertson S (2003) Building a filtering test collection for trec 2002. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’03. ACM, New York, pp 243–250
Spärck Jones K (1974) Automatic indexing. J Doc 30:393–432
Spärck Jones K (2001) Automatic language and information processing: rethinking evaluation. Nat Lang Eng 7(1):29–46
Spärck Jones K, Bates RG (1977) Report on a design study for the ‘IDEAL’ information retrieval test collection. Tech. rep., Computer Laboratory, University of Cambridge. Available at http://sigir.org/resources/museum/
Spärck Jones K, Van Rijsbergen C (1975) Report on the need for and provision for and ‘IDEAL’ information retrieval test collection. Tech. rep., Computer Laboratory, University of Cambridge. Available at http://sigir.org/resources/museum/
Taube M (1965) A note on the pseudomathematics of relevance. Am Doc 16(2):69–72
Tomlinson S, Hedin B (2011) Measuring effectiveness in the TREC legal track. In: Lupu M, Mayer K, Tait J, Trippe A (eds) Current challenges in patent information retrieval. The information retrieval series, vol 29. Springer, Berlin, pp 167–180
Trotman A, Geva S, Kamps J, Lalmas M, Murdock V (2010) Current research in focused retrieval and result aggregation. Inf Retr 13(5):407–411
Turpin AH, Hersh W (2001) Why batch and user evaluations do not give the same results. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’01, pp 225–231
Van Rijsbergen C (1979) Evaluation, 2nd edn. Butterworths, London, chap 7
Voorhees EM (2000) Variations in relevance judgments and the measurement of retrieval effectiveness. Inf Process Process 36:697–716
Voorhees EM (2005) Question answering in TREC. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. MIT Press, Boston, chap 10, pp 233–257
Voorhees EM (2014) The effect of sampling strategy on inferred measures. In: Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval, SIGIR ’14. ACM, New York, pp 1119–1122
Voorhees EM, Buckley C (2002) The effect of topic set size on retrieval experiment error. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’02. ACM, New York, pp 316–323
Voorhees EM, Harman DK (2005) The Text REtrieval Conference. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. MIT Press, Boston, chap 1, pp 3–19
Yilmaz E, Aslam JA (2008) Estimating average precision when judgments are incomplete. Knowl Inf Syst 16:173–211
Yilmaz E, Kanoulas E, Aslam JA (2008) A simple and efficient sampling method for estimating AP and NDCG. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, pp 603–610
Zobel J (1998) How reliable are the results of large-scale information retrieval experiments? In: Croft WB, Moffat A, van Rijsbergen C, Wilkinson R, Zobel J (eds) Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, Melbourne, Australia. ACM Press, New York, pp 307–314
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 This is a U.S. government work and not under copyright protection in the U.S.; foreign copyright protection may apply
About this chapter
Cite this chapter
Voorhees, E.M. (2019). The Evolution of Cranfield. In: Ferro, N., Peters, C. (eds) Information Retrieval Evaluation in a Changing World. The Information Retrieval Series, vol 41. Springer, Cham. https://doi.org/10.1007/978-3-030-22948-1_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-22948-1_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-22947-4
Online ISBN: 978-3-030-22948-1
eBook Packages: Computer ScienceComputer Science (R0)