research-article

Meta-evaluation of Online and Offline Web Search Evaluation Metrics

Authors:

Shaoping MaAuthors Info & Claims

SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 15 - 24

https://doi.org/10.1145/3077136.3080804

Published: 07 August 2017 Publication History

Abstract

As in most information retrieval (IR) studies, evaluation plays an essential part in Web search research. Both offline and online evaluation metrics are adopted in measuring the performance of search engines. Offline metrics are usually based on relevance judgments of query-document pairs from assessors while online metrics exploit the user behavior data, such as clicks, collected from search engines to compare search algorithms. Although both types of IR evaluation metrics have achieved success, to what extent can they predict user satisfaction still remains under-investigated. To shed light on this research question, we meta-evaluate a series of existing online and offline metrics to study how well they infer actual search user satisfaction in different search scenarios. We find that both types of evaluation metrics significantly correlate with user satisfaction while they reflect satisfaction from different perspectives for different search tasks. Offline metrics better align with user satisfaction in homogeneous search (i.e. ten blue links) whereas online metrics outperform when vertical results are federated. Finally, we also propose to incorporate mouse hover information into existing online evaluation metrics, and empirically show that they better align with search user satisfaction than click-based online metrics.

References

[1]

Mikhail Ageev, Qi Guo, Dmitry Lagun, and Eugene Agichtein. 2011. Find it if you can: a game for modeling different types of web search success using interaction data. In SIGIR'11. ACM, 345--354.

Digital Library

[2]

Azzah Al-Maskari, Mark Sanderson, and Paul Clough. 2007. The relationship between IR effectiveness measures and user satisfaction SIGIR'07. ACM, 773--774.

[3]

Lorin W. Anderson, David R. Krathwohl, and Benjamin Samuel Bloom. 2001. A taxonomy for learning, teaching, and assessing: A revision of Bloom's taxonomy of educational objectives. Allyn & Bacon.

[4]

Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. Noise reduction in speech processing. Springer, 1--4.

[5]

Andrei Broder. 2002. A taxonomy of web search. In ACM Sigir forum, Vol. Vol. 36. ACM, 3--10.

Digital Library

[6]

Chris Buckley and Ellen M. Voorhees 2000. Evaluating evaluation measure stability. In SIGIR'00. ACM, 33--40.

Digital Library

[7]

Stefan Büttcher, Charles L. A. Clarke, Peter C. K. Yeung, and Ian Soboroff. 2007. Reliable information retrieval evaluation with incomplete and biased judgements SIGIR'07. ACM, 63--70.

[8]

Ben Carterette, Evangelos Kanoulas, and Emine Yilmaz. 2010. Low cost evaluation in information retrieval. In SIGIR'10. ACM, 903--903.

Digital Library

[9]

Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In CIKM'09. ACM, 621--630.

Digital Library

[10]

Ye Chen, Yiqun Liu, Ke Zhou, Meng Wang, Min Zhang, and Shaoping Ma. 2015. Does Vertical Bring more Satisfaction? Predicting Search Satisfaction in a Heterogeneous Environment. In CIKM'15. ACM, 1581--1590.

Digital Library

[11]

Aleksandr Chuklin, Pavel Serdyukov, and Maarten De Rijke. 2013. Click model-based information retrieval metrics. SIGIR'13. ACM, 493--502.

Digital Library

[12]

Cyril Cleverdon, Jack Mills, and Michael Keen. 1966. FACTORS DETERMINING THE PERFORMANCE OF INDEXING SYSTEMS VOLUME 1. DESIGN. (1966).

[13]

Alex Deng and Xiaolin Shi. 2016. Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned SIKDD'16. ACM.

Digital Library

[14]

Geoffrey Evans, Anthony Heath, and Mansur Lalljee. 1996. Measuring left-right and libertarian-authoritarian values in the British electorate. British Journal of Sociology (1996), 93--112.

[15]

Steve Fox, Kuldeep Karnawat, Mark Mydland, Susan Dumais, and Thomas White. 2005. Evaluating implicit measures to improve web search. TOIS'05, Vol. 23, 2 (2005), 147--168.

Digital Library

[16]

Qi Guo, Dmitry Lagun, and Eugene Agichtein. 2012. Predicting web search success with fine-grained interaction data CIKM'12. ACM, 2050--2054.

[17]

Ahmed Hassan, Xiaolin Shi, Nick Craswell, and Bill Ramsey. 2013. Beyond clicks: query reformulation as a predictor of search satisfaction CIKM'13. ACM, 2019--2028.

Digital Library

[18]

Ahmed Hassan, Yang Song, and Li-wei He. 2011. A task level metric for measuring web search satisfaction and its application on improving relevance estimation. In CIKM'11. ACM, 125--134.

Digital Library

[19]

Katja Hofmann, Lihong Li, Filip Radlinski, and others. 2016. Online evaluation for information retrieval. Foundations and Trends® in Information Retrieval, Vol. 10, 1 (2016), 1--117.

Digital Library

[20]

Scott B. Huffman and Michael Hochster. 2007. How well does result relevance predict session satisfaction? SIGIR'07. ACM, 567--574.

Digital Library

[21]

Jiepu Jiang, Ahmed Hassan Awadallah, Xiaolin Shi, and Ryen W. White. 2015. Understanding and predicting graded search satisfaction WSDM'15. ACM, 57--66.

[22]

Jiepu Jiang, Daqing He, and James Allan. 2014. Searching, browsing, and clicking in a search session: changes in user behavior by task and over time. In SIGIR'14. ACM, 607--616.

Digital Library

[23]

Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. SIGKDD'02. ACM, 133--142.

Digital Library

[24]

Evangelos Kanoulas, Ben Carterette, Paul D. Clough, and Mark Sanderson. 2011. Evaluating multi-query sessions. In SIGIR'11. ACM, 1053--1062.

Digital Library

[25]

Diane Kelly. 2009. Methods for evaluating interactive information retrieval systems with users. Foundations and Trends in Information Retrieval, Vol. 3, 1-2 (2009), 1--224.

Digital Library

[26]

Julia Kiseleva, Kyle Williams, Jiepu Jiang, Ahmed Hassan Awadallah, Imed Zitouni, Aidan C Crook, and Tasos Anastasakos. 2016. Predicting User Satisfaction with Intelligent Assistants SIGIR'16. 495--505.

[27]

Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M. Henne. 2009. Controlled experiments on the web: survey and practical guide. Data mining and knowledge discovery Vol. 18, 1 (2009), 140--181.

Digital Library

[28]

Dmitry Lagun, Chih-Hung Hsieh, Dale Webster, and Vidhya Navalpakkam. 2014. Towards better measurement of attention and satisfaction in mobile search SIGIR'14. ACM, 113--122.

[29]

Jane Li, Scott Huffman, and Akihito Tokuda. 2009. Good abandonment in mobile and PC internet search. SIGIR'09. ACM, 43--50.

Digital Library

[30]

Lihong Li, Jin Young Kim, and Imed Zitouni. 2015. Toward predicting the outcome of an A/B experiment for search relevance WSDM'15. ACM, 37--46.

[31]

Yiqun Liu, Ye Chen, Jinhui Tang, Jiashen Sun, Min Zhang, Shaoping Ma, and Xuan Zhu. 2015. Different users, different opinions: Predicting search satisfaction with mouse movement information. In SIGIR'15. ACM, 493--502.

Digital Library

[32]

Zeyang Liu, Yiqun Liu, Ke Zhou, Min Zhang, and Shaoping Ma. 2015. Influence of vertical result in web search examination SIGIR'15. ACM, 193--202.

[33]

Xiaolu Lu, Alistair Moffat, and J. Shane Culpepper. 2016. The effect of pooling and evaluation depth on IR metrics. Information Retrieval Journal Vol. 19, 4 (2016), 416--445.

Digital Library

[34]

Jiaxin Mao, Yiqun Liu, Ke Zhou, Jian-Yun Nie, Jingtao Song, Min Zhang, Shaoping Ma, Jiashen Sun, and Hengliang Luo. 2016. When does Relevance Mean Usefulness and User Satisfaction in Web Search? SIGIR'16. ACM.

Digital Library

[35]

Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. TOIS'08, Vol. 27, 1 (2008), 2.

Digital Library

[36]

Vidhya Navalpakkam, LaDawn Jentzsch, Rory Sayres, Sujith Ravi, Amr Ahmed, and Alex Smola. 2013. Measurement and modeling of eye-mouse behavior in the presence of nonlinear page layouts WWW'13. ACM, 953--964.

[37]

Filip Radlinski and Nick Craswell. 2010. Comparing the sensitivity of information retrieval metrics SIGIR'10. ACM, 667--674.

[38]

Sri Devi Ravana and Alistair Moffat. 2010. Score estimation, incomplete judgments, and significance testing in IR evaluation AIRS'10. Springer, 97--109.

[39]

Tetsuya Sakai. 2006. Evaluating evaluation metrics based on the bootstrap SIGIR'06. ACM, 525--532.

[40]

Tetsuya Sakai. 2013. How intuitive are diversified search metrics? Concordance test results for the diversity U-measures. In AIRS'13. Springer, 13--24.

[41]

Tetsuya Sakai and Noriko Kando. 2008. On information retrieval metrics designed for evaluation with incomplete relevance assessments. Information Retrieval Vol. 11, 5 (2008), 447--470.

Digital Library

[42]

Mark Sanderson. 2010. Test collection based evaluation of information retrieval systems. Now Publishers Inc.

[43]

Anne Schuth, Katja Hofmann, and Filip Radlinski. 2015. Predicting search satisfaction metrics with interleaved comparisons SIGIR'15. ACM, 463--472.

[44]

Hinrich Schütze. 2008. Introduction to Information Retrieval. In Proceedings of the international communication of association for computing machinery conference.

[45]

Louise T. Su. 1992. Evaluation measures for interactive information retrieval. Information Processing & Management Vol. 28, 4 (1992), 503--516.

Digital Library

[46]

Louise T. Su. 2003. A comprehensive and systematic model of user evaluation of Web search engines: II. An evaluation by undergraduates. Journal of the American Society for Information Science and Technology, Vol. 54, 13 (2003), 1193--1223.

Digital Library

[47]

Hongning Wang, Yang Song, Ming-Wei Chang, Xiaodong He, Ahmed Hassan, and Ryen W White. 2014. Modeling action-level satisfaction for search task satisfaction prediction SIGIR'14. ACM, 123--132.

[48]

William Webber, Alistair Moffat, and Justin Zobel. 2010. A similarity measure for indefinite rankings. TOIS'10, Vol. 28, 4 (2010), 20.

Digital Library

[49]

Emine Yilmaz, Evangelos Kanoulas, and Javed A. Aslam. 2008. A simple and efficient sampling method for estimating AP and NDCG SIGIR'08. ACM, 603--610.

[50]

Emine Yilmaz and Stephen Robertson. 2010. On the choice of effectiveness measures for learning to rank. Information Retrieval Vol. 13, 3 (2010), 271--290.

Digital Library

[51]

Masrour Zoghi, Tomáš Tunys, Lihong Li, Damien Jose, Junyan Chen, Chun Ming Chin, and Maarten de Rijke. 2016. Click-based Hot Fixes for Underperforming Torso Queries. SIGIR.

Digital Library

Cited By

Thomas PKazai GCraswell NSpielman SHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)What Matters in a Measure? A Perspective from Large-Scale Search EvaluationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657845(282-292)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657845
Moffat AMackenzie J(2024)How much freedom does an effectiveness metric really have?Journal of the Association for Information Science and Technology10.1002/asi.24874Online publication date: 15-Feb-2024
https://doi.org/10.1002/asi.24874
Bauer CCarterette BFerro NFuhr NBeel JBreuer TClarke CCrescenzi ADemartini GDi Nunzio GDietz LFaggioli GFerwerda BFröbe MHagen MHanbury AHauff CJannach DKando NKanoulas EKnijnenburg BKruschwitz ULi MMaistro MMichiels LPapenmeier APotthast MRosso PSaid ASchaer PSeifert CSpina DStein BTintarev NUrbano JWachsmuth HWillemsen MZobel J(2023)Report on the Dagstuhl Seminar on Frontiers of Information Access Experimentation for Research and EducationACM SIGIR Forum10.1145/3636341.363635157:1(1-28)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1145/3636341.3636351
Show More Cited By

Index Terms

Meta-evaluation of Online and Offline Web Search Evaluation Metrics
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Retrieval effectiveness
    2. Information retrieval query processing
      1. Query log analysis

Recommendations

How Well do Offline and Online Evaluation Metrics Measure User Satisfaction in Web Image Search?
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

Comparing to general Web search engines, image search engines present search results differently, with two-dimensional visual image panel for users to scroll and browse quickly. These differences in result presentation can significantly impact the way ...
How Well do Offline Metrics Predict Online Performance of Product Ranking Models?
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Online evaluation techniques are widely adopted by industrial search engines to determine which ranking models perform better under a certain business metric. However, online evaluation can only evaluate a small number of rankers and people resort to ...
Query/Task Satisfaction and Grid-based Evaluation Metrics Under Different Image Search Intents
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

People use web image search with various search intents: from serious demands for work to just passing time by browsing images of a favorite actor. Such a diversity of intents can influence user satisfaction and evaluation metrics, both of which are ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

August 2017

1476 pages

ISBN:9781450350228

DOI:10.1145/3077136

General Chairs:
Noriko Kando
National Institute of Informatics
,
Tetsuya Sakai
Waseda University
,
Hideo Joho
University of Tsukuba
,
Program Chairs:
Hang Li
Huawei Noah's Ark Lab
,
Arjen P. de Vries
Radboud University
,
Ryen W. White
Microsoft Cortana

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 August 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Tsinghua University Initiative Scientific Research Program
National Key Basic Research Program
Natural Science Foundation of China

Conference

SIGIR '17

Sponsor:

SIGIR

SIGIR '17: The 40th International ACM SIGIR conference on research and development in Information Retrieval

August 7 - 11, 2017

Tokyo, Shinjuku, Japan

Acceptance Rates

SIGIR '17 Paper Acceptance Rate 78 of 362 submissions, 22%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

36
Total Citations
View Citations
654
Total Downloads

Downloads (Last 12 months)42
Downloads (Last 6 weeks)4

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Thomas PKazai GCraswell NSpielman SHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)What Matters in a Measure? A Perspective from Large-Scale Search EvaluationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657845(282-292)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657845
Moffat AMackenzie J(2024)How much freedom does an effectiveness metric really have?Journal of the Association for Information Science and Technology10.1002/asi.24874Online publication date: 15-Feb-2024
https://doi.org/10.1002/asi.24874
Bauer CCarterette BFerro NFuhr NBeel JBreuer TClarke CCrescenzi ADemartini GDi Nunzio GDietz LFaggioli GFerwerda BFröbe MHagen MHanbury AHauff CJannach DKando NKanoulas EKnijnenburg BKruschwitz ULi MMaistro MMichiels LPapenmeier APotthast MRosso PSaid ASchaer PSeifert CSpina DStein BTintarev NUrbano JWachsmuth HWillemsen MZobel J(2023)Report on the Dagstuhl Seminar on Frontiers of Information Access Experimentation for Research and EducationACM SIGIR Forum10.1145/3636341.363635157:1(1-28)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1145/3636341.3636351
Shao YLi HWu YLiu YAi QMao JMa YMa S(2023)An Intent Taxonomy of Legal Case RetrievalACM Transactions on Information Systems10.1145/362609342:2(1-27)Online publication date: 29-Sep-2023
https://dl.acm.org/doi/10.1145/3626093
Breuer TFuhr NSchaer P(2023)Validating Synthetic Usage Data in Living Lab EnvironmentsJournal of Data and Information Quality10.1145/3623640Online publication date: 24-Sep-2023
https://dl.acm.org/doi/10.1145/3623640
Massiah JYilmaz EJiao YKazai GFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)On the Reliability of User Feedback for Evaluating the Quality of Conversational AgentsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615286(4185-4189)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3615286
Chen NLiu JSakai T(2023)A Reference-Dependent Model for Web Search EvaluationProceedings of the ACM Web Conference 202310.1145/3543507.3583551(3396-3405)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583551
Zhang JLiu YMao JMa WXu JMa STian Q(2023)User Behavior Simulation for Search Result Re-rankingACM Transactions on Information Systems10.1145/351146941:1(1-35)Online publication date: 20-Jan-2023
https://dl.acm.org/doi/10.1145/3511469
Markwald MLiu JYu R(2023)Constructing and meta-evaluating state-aware evaluation metrics for interactive search systemsInformation Retrieval Journal10.1007/s10791-023-09426-126:1-2Online publication date: 31-Oct-2023
https://doi.org/10.1007/s10791-023-09426-1
Liu JLiu J(2023)Back to the Fundamentals: Extend the Rational AssumptionsA Behavioral Economics Approach to Interactive Information Retrieval10.1007/978-3-031-23229-9_5(131-152)Online publication date: 18-Feb-2023
https://doi.org/10.1007/978-3-031-23229-9_5
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents