research-article

Including summaries in system evaluation

Authors:

Kalvero Jarvelin,

J. Shane CulpepperAuthors Info & Claims

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Pages 508 - 515

https://doi.org/10.1145/1571941.1572029

Published: 19 July 2009 Publication History

Abstract

In batch evaluation of retrieval systems, performance is calculated based on predetermined relevance judgements applied to a list of documents returned by the system for a query. This evaluation paradigm, however, ignores the current standard operation of search systems which require the user to view summaries of documents prior to reading the documents themselves.

In this paper we modify the popular IR metrics MAP and P@10 to incorporate the summary reading step of the search process, and study the effects on system rankings using TREC data. Based on a user study, we establish likely disagreements between relevance judgements of summaries and of documents, and use these values to seed simulations of summary relevance in the TREC data. Re-evaluating the runs submitted to the TREC Web Track, we find the average correlation between system rankings and the original TREC rankings is 0.8 (Kendall τ), which is lower than commonly accepted for system orderings to be considered equivalent. The system that has the highest MAP in TREC generally remains amongst the highest MAP systems when summaries are taken into account, but other systems become equivalent to the top ranked system depending on the simulated summary relevance.

Given that system orderings alter when summaries are taken into account, the small amount of effort required to judge summaries in addition to documents (19 seconds vs 88 seconds on average in our data) should be undertaken when constructing test collections.

References

[1]

A. Al-Maskari, M. Sanderson, P. Clough, and E. Airio. The good and bad system: does the test collection predict users' eectiveness? In Proc. ACM SIGIR, pages 59--66, Singapore, Singapore, 2008.

Digital Library

[2]

J. Allan, B. Carterette, and J. Lewis. When will information retrieval be good enough? In Proc. ACM SIGIR, pages 433--440, Salvador, Brazil, 2005.

Digital Library

[3]

C. Buckley and E.M. Voorhees. Retrieval evaluation with incomplete information. In Proc. ACM SIGIR, pages 25--32, Shefield, UK, 2004.

Digital Library

[4]

C. Buckley and E.M. Voorhees. Retrieval system evaluation. In Ellen M. Voorhees and Donna K. Harman, editors, TREC: experiment and evaluation in information retrieval. MIT Press, 2005.

[5]

B. Carterette and J. Allan. Incremental test collections. In Proc. ACM CIKM, pages 680--687, Bremen, Germany, 2005.

Digital Library

[6]

C. Cleverdon. The Cranfield tests on index language devices. Aslib Proceedings, 19:173--192, 1967. (Reprinted in K. Sparck Jones and P. Willett, editors. Readings in Information Retrieval. Morgan Kaufmann Publishers Inc., 1997).

[7]

C. Cleverdon. Optimizing convenient online access to bibliographic databases. Information Services and Use, 4(1-2):37--47, 1984.

Digital Library

[8]

D. Hawking. Overview of the TREC-9 Web track. In TREC-9, pages 87--102, Gaithersburg, MD, 2000.

[9]

D. Hawking and N. Craswell. Overview of TREC 2001 Web track. In TREC 2001, pages 61--67, Gaithersburg, MD, 2001.

[10]

W. Hersh, A. Turpin, S. Price, B. Chan, D. Kraemer, L. Sacherek, and D. Olson. Do batch and user evaluations give the same results? In Proc. ACM SIGIR, pages 17--24, Athens, Greece, 2000.

Digital Library

[11]

P. Ingwersen and K. Järvelin. The Turn: Integration of Information Seeking and Retrieval in Context. Kluwer Academic Publishers, 2005.

Digital Library

[12]

K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422--446, 2002.

Digital Library

[13]

T. Joachims, L.A. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough data as implicit feedback. In Proc. ACM SIGIR, pages 154--161, Salvador, Brazil, 2005.

Digital Library

[14]

D. Kelly, X. Fu, and C. Shah. Eects of rank and precision of search results on users' evaluations of system performance. Technical Report TR-2007-02, University of North Carolina, 2007.

[15]

R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2008. ISBN 3-900051-07-0.

[16]

M. Sanderson and H. Joho. Forming test collections with no system pooling. In Proc. ACM SIGIR, pages 33--40, Shefield, United Kingdom, 2004.

Digital Library

[17]

M. Sanderson and I. Soboro. Problems with Kendall's tau. In Proc. ACM SIGIR, pages 839--840, Amsterdam, The Netherlands, 2007.

Digital Library

[18]

F. Scholer and H.E. Williams. Query association for eective retrieval. In Proc. ACM CIKM, pages 324--331, McLean, VA, 2002.

Digital Library

[19]

D. Sheskin. Handbook of parametric and nonparametric statistical procedures. CRC Press, 1997.

Digital Library

[20]

E. Sormunen. Liberal relevance criteria of TREC -- counting on negligible documents? In Proc. ACM SIGIR, pages 324--330, Tampere, Finland, 2002.

Digital Library

[21]

A. Tombros and M. Sanderson. Advantages of query biased summaries in information retrieval. In Proc. ACM SIGIR, pages 2--10, Melbourne, Australia, 1998.

Digital Library

[22]

A. Turpin and W. Hersh. Why batch and user evaluations do not give the same results. In Proc. ACM SIGIR, pages 225--231, New Orleans, LA, 2001.

Digital Library

[23]

A. Turpin and F. Scholer. User performance versus precision measures for simple search tasks. In Proc. ACM SIGIR, pages 11--18, Seattle, WA, 2006.

Digital Library

[24]

I. Varlamis and S. Stamou. Semantically driven snippet selection for supporting focused web searches. Data&Knowledge Engineering, 68:261--277, 2009.

Digital Library

[25]

E.M. Voorhees. Variations in relevance judgements and the measurement of retrieval eectiveness. In Proc. ACM SIGIR, pages 315--323, Melbourne, Australia, 1998.

Digital Library

[26]

E.M. Voorhees. Evaluation by highly relevant documents. In Proc. ACM SIGIR, pages 74--82, New Orleans, LA, 2001.

Digital Library

[27]

E.M. Voorhees and D.K. Harman. TREC: experiment and evaluation in information retrieval. MIT Press, 2005.

Digital Library

[28]

M. Wu, M. Fuller, and R. Wilkinson. Searcher performance in question answering. In Proc. ACM SIGIR, pages 375--381, New Orleans, LA, 2001.

Digital Library

[29]

E. Yilmaz and J.A. Aslam. Estimating average precision with incomplete and imperfect judgments. In Proc. ACM CIKM, pages 102--111, Arlington, VA, 2006.

Digital Library

Cited By

Huang CCasey AGłowacka DMedlar AAzzopardi LHalvey MRuthven IJoho HMurdock VQvarfordt P(2019)Holes in the OutlineProceedings of the 2019 Conference on Human Information Interaction and Retrieval10.1145/3295750.3298953(289-293)Online publication date: 8-Mar-2019
https://dl.acm.org/doi/10.1145/3295750.3298953
Thomas PMoffat ABailey PScholer FCraswell N(2018)Better Effectiveness Metrics for SERPs, Cards, and RankingsProceedings of the 23rd Australasian Document Computing Symposium10.1145/3291992.3292002(1-8)Online publication date: 11-Dec-2018
https://dl.acm.org/doi/10.1145/3291992.3292002
Amigó ESpina DCarrillo-de-Albornoz JCollins-Thompson KMei QDavison BLiu YYilmaz E(2018)An Axiomatic Analysis of Diversity Evaluation MetricsThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval10.1145/3209978.3210024(625-634)Online publication date: 27-Jun-2018
https://dl.acm.org/doi/10.1145/3209978.3210024
Show More Cited By

Index Terms

Including summaries in system evaluation
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Diversified search evaluation: lessons from the NTCIR-9 INTENT task
Abstract
The evaluation of diversified web search results is a relatively new research topic and is not as well-understood as the time-honoured evaluation methodology of traditional IR based on precision and recall. In diversity evaluation, one topic may ...
Diverse and Proportional Size-l Object Summaries for Keyword Search
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

The abundance and ubiquity of graphs (e.g., Online Social Networks such as Google+ and Facebook; bibliographic graphs such as DBLP) necessitates the effective and efficient search over them. Given a set of keywords that can identify a Data Subject (DS), ...
Hits hits TREC: exploring IR evaluation results with network analysis
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

We propose a novel method of analysing data gathered fromTREC or similar information retrieval evaluation experiments. We define two normalized versions of average precision, that we use to construct a weighted bipartite graph of TREC systems and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

July 2009

896 pages

ISBN:9781605584836

DOI:10.1145/1571941

General Chairs:
James Allan
University of Massachusetts Amherst, USA
,
Javed Aslam
Northeastern University, USA
,
Program Chairs:
Mark Sanderson
University of Sheffield, UK
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Justin Zobel
University of Melbourne, Australia

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '09

Sponsor:

SIGIR '09: The 32nd International ACM SIGIR conference on research and development in Information Retrieval

July 19 - 23, 2009

MA, Boston, USA

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

40
Total Citations
View Citations
509
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Huang CCasey AGłowacka DMedlar AAzzopardi LHalvey MRuthven IJoho HMurdock VQvarfordt P(2019)Holes in the OutlineProceedings of the 2019 Conference on Human Information Interaction and Retrieval10.1145/3295750.3298953(289-293)Online publication date: 8-Mar-2019
https://dl.acm.org/doi/10.1145/3295750.3298953
Thomas PMoffat ABailey PScholer FCraswell N(2018)Better Effectiveness Metrics for SERPs, Cards, and RankingsProceedings of the 23rd Australasian Document Computing Symposium10.1145/3291992.3292002(1-8)Online publication date: 11-Dec-2018
https://dl.acm.org/doi/10.1145/3291992.3292002
Amigó ESpina DCarrillo-de-Albornoz JCollins-Thompson KMei QDavison BLiu YYilmaz E(2018)An Axiomatic Analysis of Diversity Evaluation MetricsThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval10.1145/3209978.3210024(625-634)Online publication date: 27-Jun-2018
https://dl.acm.org/doi/10.1145/3209978.3210024
Koopman BRussell JZuccon G(2018)Task-oriented search for evidence-based medicineInternational Journal on Digital Libraries10.1007/s00799-017-0209-719:2-3(217-229)Online publication date: 1-Sep-2018
https://dl.acm.org/doi/10.1007/s00799-017-0209-7
Moffat ABailey PScholer FThomas P(2017)Incorporating User Expectations and Behavior into the Measurement of Search EffectivenessACM Transactions on Information Systems10.1145/305276835:3(1-38)Online publication date: 5-Jun-2017
https://dl.acm.org/doi/10.1145/3052768
Pääkkönen TKekäläinen JKeskustalo HAzzopardi LMaxwell DJärvelin K(2017)Validating simulated interaction for retrieval evaluationInformation Retrieval Journal10.1007/s10791-017-9301-220:4(338-362)Online publication date: 6-May-2017
https://doi.org/10.1007/s10791-017-9301-2
Spina DTrippas JCavedon LSanderson M(2017)Extracting audio summaries to support effective spoken document searchJournal of the Association for Information Science and Technology10.1002/asi.2383168:9(2101-2115)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1002/asi.23831
Watanabe ASasano RTakamura HOkumura M(2016)Generating Personalized Snippets forWeb Page Recommender SystemsTransactions of the Japanese Society for Artificial Intelligence10.1527/tjsai.C-G4131:5(C-G41_1-12)Online publication date: 2016
https://doi.org/10.1527/tjsai.C-G41
Chuklin Ade Rijke MMukhopadhyay SZhai CBertino ECrestani FMostafa JTang JSi LZhou XChang YLi YSondhi P(2016)Incorporating Clicks, Attention and Satisfaction into a Search Engine Result Page Evaluation ModelProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983829(175-184)Online publication date: 24-Oct-2016
https://dl.acm.org/doi/10.1145/2983323.2983829
Demeester TAly RHiemstra DNguyen DDevelder C(2016)Predicting relevance based on assessor disagreement: analysis and practical applications for search evaluationInformation Retrieval10.1007/s10791-015-9275-x19:3(284-312)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1007/s10791-015-9275-x
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents