research-article

Evaluation over thousands of queries

Authors:

Ben Carterette,

Evangelos Kanoulas,

Javed A. Aslam,

James AllanAuthors Info & Claims

SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Pages 651 - 658

https://doi.org/10.1145/1390334.1390445

Published: 20 July 2008 Publication History

Abstract

Information retrieval evaluation has typically been performed over several dozen queries, each judged to near-completeness. There has been a great deal of recent work on evaluation over much smaller judgment sets: how to select the best set of documents to judge and how to estimate evaluation measures when few judgments are available. In light of this, it should be possible to evaluate over many more queries without much more total judging effort. The Million Query Track at TREC 2007 used two document selection algorithms to acquire relevance judgments for more than 1,800 queries. We present results of the track, along with deeper analysis: investigating tradeoffs between the number of queries and number of judgments shows that, up to a point, evaluation over more queries with fewer judgments is more cost-effective and as reliable as fewer queries with more judgments. Total assessor effort can be reduced by 95% with no appreciable increase in evaluation errors.

References

[1]

J. Allan, B. Carterette, J. A. Aslam, V. Pavlu, B. Dachev, and E. Kanoulas. Overview of the TREC 2007 Million Query Track. In Proceedings of TREC, 2007.

[2]

J. A. Aslam and V. Pavlu. A practical sampling strategy for efficient retrieval evaluation, technical report.

[3]

J. A. Aslam and V. Pavlu. Query hardness estimation using Jensen-Shannon divergence among multiple scoring functions. In Proceedings of ECIR, pages 198--209. 2007.

Digital Library

[4]

J. A. Aslam, V. Pavlu, and E. Yilmaz. Measure-based metasearch. In Proceedings of SIGIR, pages 571--572, 2005.

Digital Library

[5]

J. A. Aslam, V. Pavlu, and E. Yilmaz. A statistical method for system evaluation using incomplete judgments. In Proceedings of SIGIR, pages 541--548, 2006.

Digital Library

[6]

D. Bodoff and P. Li. Test theory for assessing ir test collection. In Proceedings of SIGIR, pages 367--374, 2007.

Digital Library

[7]

R. L. Brennan. Generalizability Theory. Springer-Verlag, New York, 2001.

[8]

K. R. W. Brewer and M. Hanif. Sampling With Unequal Probabilities. Springer, New York, 1983.

[9]

C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees. Bias and the limits of pooling. In Proceedings of SIGIR, pages 619--620, 2006.

Digital Library

[10]

B. Carterette. Robust test collections for retrieval evaluation. In Proceedings of SIGIR, pages 55--62, 2007.

Digital Library

[11]

B. Carterette, J. Allan, and R. K. Sitaraman. Minimal test collections for retrieval evaluation. In Proceedings of SIGIR, pages 268--275, 2006.

Digital Library

[12]

B. Carterette and M. Smucker. Hypothesis testing with incomplete relevance judgments. In Proceedings of CIKM, pages 643--652, 2007.

Digital Library

[13]

C. Clarke, N. Craswell, and I. Soboroff. Overview of the TREC 2004 Terabyte Track. In Proceedings of TREC, 2004.

[14]

C. L. A. Clarke, F. Scholer, and I. Soboroff. The TREC 2005 terabyte track. In Proceedings of TREC, 2005.

[15]

E. C. Jensen. Repeatable Evaluation of Information Retrieval Effectiveness in Dynamic Environments. PhD thesis, Illinois Institute of Technology, 2006.

Digital Library

[16]

M. Sanderson and J. Zobel. Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of SIGIR, pages 162--169, 2005.

Digital Library

[17]

K. Sparck Jones and C. J. van Rijsbergen. Information retrieval test collections. Journal of Documentation, 32(1):59--75, 1976.

[18]

W. L. Stevens. Sampling without replacement with probability proportional to size. Journal of the Royal Statistical Society. Series B (Methodological), Vol. 20, No. 2. (1958), pp. 393--397

[19]

S. K. Thompson. Sampling. Wiley Series in Probability and Mathematical Statistics, 1992.

[20]

E. Yilmaz and J. A. Aslam. Estimating average precision with incomplete and imperfect judgments. In Proceedings of CIKM, pages 102--111, 2006.

Digital Library

[21]

J. Zobel. How reliable are the results of large-scale retrieval experiments? In Proceedings of SIGIR, pages 307--314, 1998.

Digital Library

Cited By

Joseph MRavana S(2024)Reliable Information Retrieval Systems Performance Evaluation: A ReviewIEEE Access10.1109/ACCESS.2024.337723912(51740-51751)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3377239
Rahman MKutlu MLease M(2022)Understanding and Predicting Characteristics of Test Collections in Information RetrievalInformation for a Better World: Shaping the Global Future10.1007/978-3-030-96960-8_10(136-148)Online publication date: 23-Feb-2022
https://doi.org/10.1007/978-3-030-96960-8_10
Ferrante MFerro NFuhr N(2021)Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval ScalesIEEE Access10.1109/ACCESS.2021.31168579(136182-136216)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3116857
Show More Cited By

Index Terms

Evaluation over thousands of queries
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Minimal test collections for retrieval evaluation
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Accurate estimation of information retrieval evaluation metrics such as average precision require large sets of relevance judgments. Building sets large enough for evaluation of real-world implementations is at best inefficient, at worst infeasible. In ...
Robust test collections for retrieval evaluation
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Low-cost methods for acquiring relevance judgments can be a boon to researchers who need to evaluate new retrieval tasks or topics but do not have the resources to make thousands of judgments. While these judgments are very useful for a one-time ...
Incremental test collections
CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management

Corpora and topics are readily available for information retrieval research. Relevance judgments, which are necessary for system evaluation, are expensive; the cost of obtaining them prohibits in-house evaluation of retrieval systems on new corpora or ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

July 2008

934 pages

ISBN:9781605581644

DOI:10.1145/1390334

General Chairs:
Tat-Seng Chua
National University of Singapore
,
Mun-Kew Leong
National Library Board, Singapore
,
Program Chairs:
Syung Hyon Myaeng
Information and Communications University, Korea
,
Douglas W. Oard
University of Maryland, College Park, USA
,
Fabrizio Sebastiani
Consiglio Nazionale delle Ricerche, Italy

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '08

Sponsor:

SIGIR '08: The 31st Annual International ACM SIGIR Conference

July 20 - 24, 2008

Singapore, Singapore

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

67
Total Citations
View Citations
786
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Joseph MRavana S(2024)Reliable Information Retrieval Systems Performance Evaluation: A ReviewIEEE Access10.1109/ACCESS.2024.337723912(51740-51751)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3377239
Rahman MKutlu MLease M(2022)Understanding and Predicting Characteristics of Test Collections in Information RetrievalInformation for a Better World: Shaping the Global Future10.1007/978-3-030-96960-8_10(136-148)Online publication date: 23-Feb-2022
https://doi.org/10.1007/978-3-030-96960-8_10
Ferrante MFerro NFuhr N(2021)Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval ScalesIEEE Access10.1109/ACCESS.2021.31168579(136182-136216)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3116857
Ferro NSanderson MPiwowarski BChevalier MGaussier EMaarek YNie JScholer F(2019)Improving the Accuracy of System Performance Estimation by Using ShardsProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3338062(805-814)Online publication date: 18-Jul-2019
https://dl.acm.org/doi/10.1145/3331184.3338062
Urbano JLima HHanjalic APiwowarski BChevalier MGaussier EMaarek YNie JScholer F(2019)Statistical Significance Testing in Information RetrievalProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331259(505-514)Online publication date: 18-Jul-2019
https://dl.acm.org/doi/10.1145/3331184.3331259
Wu DLiang S(2018)Mobile Search Behavious: An In-depth Analysis based on Contexts, APPs, and DevicesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00831ED1V01Y201802ICR06310:2(i-159)Online publication date: 19-Mar-2018
https://doi.org/10.2200/S00831ED1V01Y201802ICR063
Sakai TSong DLiu TSun LBruza PMelucci MSebastiani FYang G(2018)Topic Set Size Design for Paired and Unpaired DataProceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3234944.3234971(199-202)Online publication date: 10-Sep-2018
https://dl.acm.org/doi/10.1145/3234944.3234971
Albusac Cde Campos LFernández-Luna JHuete JTramullas JTrillo-Lado RNogueras-Iso J(2018)Content-based recommendation for Academic Expert findingProceedings of the 5th Spanish Conference on Information Retrieval10.1145/3230599.3230607(1-8)Online publication date: 26-Jun-2018
https://dl.acm.org/doi/10.1145/3230599.3230607
Kurland OCulpepper JCollins-Thompson KMei QDavison BLiu YYilmaz E(2018)Fusion in Information RetrievalThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval10.1145/3209978.3210186(1383-1386)Online publication date: 27-Jun-2018
https://dl.acm.org/doi/10.1145/3209978.3210186
Sakai TSakai T(2018)Topic Set Size Design Using ExcelLaboratory Experiments in Information Retrieval10.1007/978-981-13-1199-4_6(99-132)Online publication date: 23-Sep-2018
https://doi.org/10.1007/978-981-13-1199-4_6
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents