Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3077136.3080804acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Meta-evaluation of Online and Offline Web Search Evaluation Metrics

Published: 07 August 2017 Publication History

Abstract

As in most information retrieval (IR) studies, evaluation plays an essential part in Web search research. Both offline and online evaluation metrics are adopted in measuring the performance of search engines. Offline metrics are usually based on relevance judgments of query-document pairs from assessors while online metrics exploit the user behavior data, such as clicks, collected from search engines to compare search algorithms. Although both types of IR evaluation metrics have achieved success, to what extent can they predict user satisfaction still remains under-investigated. To shed light on this research question, we meta-evaluate a series of existing online and offline metrics to study how well they infer actual search user satisfaction in different search scenarios. We find that both types of evaluation metrics significantly correlate with user satisfaction while they reflect satisfaction from different perspectives for different search tasks. Offline metrics better align with user satisfaction in homogeneous search (i.e. ten blue links) whereas online metrics outperform when vertical results are federated. Finally, we also propose to incorporate mouse hover information into existing online evaluation metrics, and empirically show that they better align with search user satisfaction than click-based online metrics.

References

[1]
Mikhail Ageev, Qi Guo, Dmitry Lagun, and Eugene Agichtein. 2011. Find it if you can: a game for modeling different types of web search success using interaction data. In SIGIR'11. ACM, 345--354.
[2]
Azzah Al-Maskari, Mark Sanderson, and Paul Clough. 2007. The relationship between IR effectiveness measures and user satisfaction SIGIR'07. ACM, 773--774.
[3]
Lorin W. Anderson, David R. Krathwohl, and Benjamin Samuel Bloom. 2001. A taxonomy for learning, teaching, and assessing: A revision of Bloom's taxonomy of educational objectives. Allyn & Bacon.
[4]
Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. Noise reduction in speech processing. Springer, 1--4.
[5]
Andrei Broder. 2002. A taxonomy of web search. In ACM Sigir forum, Vol. Vol. 36. ACM, 3--10.
[6]
Chris Buckley and Ellen M. Voorhees 2000. Evaluating evaluation measure stability. In SIGIR'00. ACM, 33--40.
[7]
Stefan Büttcher, Charles L. A. Clarke, Peter C. K. Yeung, and Ian Soboroff. 2007. Reliable information retrieval evaluation with incomplete and biased judgements SIGIR'07. ACM, 63--70.
[8]
Ben Carterette, Evangelos Kanoulas, and Emine Yilmaz. 2010. Low cost evaluation in information retrieval. In SIGIR'10. ACM, 903--903.
[9]
Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In CIKM'09. ACM, 621--630.
[10]
Ye Chen, Yiqun Liu, Ke Zhou, Meng Wang, Min Zhang, and Shaoping Ma. 2015. Does Vertical Bring more Satisfaction? Predicting Search Satisfaction in a Heterogeneous Environment. In CIKM'15. ACM, 1581--1590.
[11]
Aleksandr Chuklin, Pavel Serdyukov, and Maarten De Rijke. 2013. Click model-based information retrieval metrics. SIGIR'13. ACM, 493--502.
[12]
Cyril Cleverdon, Jack Mills, and Michael Keen. 1966. FACTORS DETERMINING THE PERFORMANCE OF INDEXING SYSTEMS VOLUME 1. DESIGN. (1966).
[13]
Alex Deng and Xiaolin Shi. 2016. Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned SIKDD'16. ACM.
[14]
Geoffrey Evans, Anthony Heath, and Mansur Lalljee. 1996. Measuring left-right and libertarian-authoritarian values in the British electorate. British Journal of Sociology (1996), 93--112.
[15]
Steve Fox, Kuldeep Karnawat, Mark Mydland, Susan Dumais, and Thomas White. 2005. Evaluating implicit measures to improve web search. TOIS'05, Vol. 23, 2 (2005), 147--168.
[16]
Qi Guo, Dmitry Lagun, and Eugene Agichtein. 2012. Predicting web search success with fine-grained interaction data CIKM'12. ACM, 2050--2054.
[17]
Ahmed Hassan, Xiaolin Shi, Nick Craswell, and Bill Ramsey. 2013. Beyond clicks: query reformulation as a predictor of search satisfaction CIKM'13. ACM, 2019--2028.
[18]
Ahmed Hassan, Yang Song, and Li-wei He. 2011. A task level metric for measuring web search satisfaction and its application on improving relevance estimation. In CIKM'11. ACM, 125--134.
[19]
Katja Hofmann, Lihong Li, Filip Radlinski, and others. 2016. Online evaluation for information retrieval. Foundations and Trends® in Information Retrieval, Vol. 10, 1 (2016), 1--117.
[20]
Scott B. Huffman and Michael Hochster. 2007. How well does result relevance predict session satisfaction? SIGIR'07. ACM, 567--574.
[21]
Jiepu Jiang, Ahmed Hassan Awadallah, Xiaolin Shi, and Ryen W. White. 2015. Understanding and predicting graded search satisfaction WSDM'15. ACM, 57--66.
[22]
Jiepu Jiang, Daqing He, and James Allan. 2014. Searching, browsing, and clicking in a search session: changes in user behavior by task and over time. In SIGIR'14. ACM, 607--616.
[23]
Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. SIGKDD'02. ACM, 133--142.
[24]
Evangelos Kanoulas, Ben Carterette, Paul D. Clough, and Mark Sanderson. 2011. Evaluating multi-query sessions. In SIGIR'11. ACM, 1053--1062.
[25]
Diane Kelly. 2009. Methods for evaluating interactive information retrieval systems with users. Foundations and Trends in Information Retrieval, Vol. 3, 1-2 (2009), 1--224.
[26]
Julia Kiseleva, Kyle Williams, Jiepu Jiang, Ahmed Hassan Awadallah, Imed Zitouni, Aidan C Crook, and Tasos Anastasakos. 2016. Predicting User Satisfaction with Intelligent Assistants SIGIR'16. 495--505.
[27]
Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M. Henne. 2009. Controlled experiments on the web: survey and practical guide. Data mining and knowledge discovery Vol. 18, 1 (2009), 140--181.
[28]
Dmitry Lagun, Chih-Hung Hsieh, Dale Webster, and Vidhya Navalpakkam. 2014. Towards better measurement of attention and satisfaction in mobile search SIGIR'14. ACM, 113--122.
[29]
Jane Li, Scott Huffman, and Akihito Tokuda. 2009. Good abandonment in mobile and PC internet search. SIGIR'09. ACM, 43--50.
[30]
Lihong Li, Jin Young Kim, and Imed Zitouni. 2015. Toward predicting the outcome of an A/B experiment for search relevance WSDM'15. ACM, 37--46.
[31]
Yiqun Liu, Ye Chen, Jinhui Tang, Jiashen Sun, Min Zhang, Shaoping Ma, and Xuan Zhu. 2015. Different users, different opinions: Predicting search satisfaction with mouse movement information. In SIGIR'15. ACM, 493--502.
[32]
Zeyang Liu, Yiqun Liu, Ke Zhou, Min Zhang, and Shaoping Ma. 2015. Influence of vertical result in web search examination SIGIR'15. ACM, 193--202.
[33]
Xiaolu Lu, Alistair Moffat, and J. Shane Culpepper. 2016. The effect of pooling and evaluation depth on IR metrics. Information Retrieval Journal Vol. 19, 4 (2016), 416--445.
[34]
Jiaxin Mao, Yiqun Liu, Ke Zhou, Jian-Yun Nie, Jingtao Song, Min Zhang, Shaoping Ma, Jiashen Sun, and Hengliang Luo. 2016. When does Relevance Mean Usefulness and User Satisfaction in Web Search? SIGIR'16. ACM.
[35]
Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. TOIS'08, Vol. 27, 1 (2008), 2.
[36]
Vidhya Navalpakkam, LaDawn Jentzsch, Rory Sayres, Sujith Ravi, Amr Ahmed, and Alex Smola. 2013. Measurement and modeling of eye-mouse behavior in the presence of nonlinear page layouts WWW'13. ACM, 953--964.
[37]
Filip Radlinski and Nick Craswell. 2010. Comparing the sensitivity of information retrieval metrics SIGIR'10. ACM, 667--674.
[38]
Sri Devi Ravana and Alistair Moffat. 2010. Score estimation, incomplete judgments, and significance testing in IR evaluation AIRS'10. Springer, 97--109.
[39]
Tetsuya Sakai. 2006. Evaluating evaluation metrics based on the bootstrap SIGIR'06. ACM, 525--532.
[40]
Tetsuya Sakai. 2013. How intuitive are diversified search metrics? Concordance test results for the diversity U-measures. In AIRS'13. Springer, 13--24.
[41]
Tetsuya Sakai and Noriko Kando. 2008. On information retrieval metrics designed for evaluation with incomplete relevance assessments. Information Retrieval Vol. 11, 5 (2008), 447--470.
[42]
Mark Sanderson. 2010. Test collection based evaluation of information retrieval systems. Now Publishers Inc.
[43]
Anne Schuth, Katja Hofmann, and Filip Radlinski. 2015. Predicting search satisfaction metrics with interleaved comparisons SIGIR'15. ACM, 463--472.
[44]
Hinrich Schütze. 2008. Introduction to Information Retrieval. In Proceedings of the international communication of association for computing machinery conference.
[45]
Louise T. Su. 1992. Evaluation measures for interactive information retrieval. Information Processing & Management Vol. 28, 4 (1992), 503--516.
[46]
Louise T. Su. 2003. A comprehensive and systematic model of user evaluation of Web search engines: II. An evaluation by undergraduates. Journal of the American Society for Information Science and Technology, Vol. 54, 13 (2003), 1193--1223.
[47]
Hongning Wang, Yang Song, Ming-Wei Chang, Xiaodong He, Ahmed Hassan, and Ryen W White. 2014. Modeling action-level satisfaction for search task satisfaction prediction SIGIR'14. ACM, 123--132.
[48]
William Webber, Alistair Moffat, and Justin Zobel. 2010. A similarity measure for indefinite rankings. TOIS'10, Vol. 28, 4 (2010), 20.
[49]
Emine Yilmaz, Evangelos Kanoulas, and Javed A. Aslam. 2008. A simple and efficient sampling method for estimating AP and NDCG SIGIR'08. ACM, 603--610.
[50]
Emine Yilmaz and Stephen Robertson. 2010. On the choice of effectiveness measures for learning to rank. Information Retrieval Vol. 13, 3 (2010), 271--290.
[51]
Masrour Zoghi, Tomáš Tunys, Lihong Li, Damien Jose, Junyan Chen, Chun Ming Chin, and Maarten de Rijke. 2016. Click-based Hot Fixes for Underperforming Torso Queries. SIGIR.

Cited By

View all
  • (2024)What Matters in a Measure? A Perspective from Large-Scale Search EvaluationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657845(282-292)Online publication date: 10-Jul-2024
  • (2024)How much freedom does an effectiveness metric really have?Journal of the Association for Information Science and Technology10.1002/asi.24874Online publication date: 15-Feb-2024
  • (2023)Report on the Dagstuhl Seminar on Frontiers of Information Access Experimentation for Research and EducationACM SIGIR Forum10.1145/3636341.363635157:1(1-28)Online publication date: 1-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval
August 2017
1476 pages
ISBN:9781450350228
DOI:10.1145/3077136
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 August 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. evaluation metrics
  2. online evaluation
  3. search satisfaction

Qualifiers

  • Research-article

Funding Sources

  • Tsinghua University Initiative Scientific Research Program
  • National Key Basic Research Program
  • Natural Science Foundation of China

Conference

SIGIR '17
Sponsor:

Acceptance Rates

SIGIR '17 Paper Acceptance Rate 78 of 362 submissions, 22%;
Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)42
  • Downloads (Last 6 weeks)4
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)What Matters in a Measure? A Perspective from Large-Scale Search EvaluationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657845(282-292)Online publication date: 10-Jul-2024
  • (2024)How much freedom does an effectiveness metric really have?Journal of the Association for Information Science and Technology10.1002/asi.24874Online publication date: 15-Feb-2024
  • (2023)Report on the Dagstuhl Seminar on Frontiers of Information Access Experimentation for Research and EducationACM SIGIR Forum10.1145/3636341.363635157:1(1-28)Online publication date: 1-Jun-2023
  • (2023)An Intent Taxonomy of Legal Case RetrievalACM Transactions on Information Systems10.1145/362609342:2(1-27)Online publication date: 29-Sep-2023
  • (2023)Validating Synthetic Usage Data in Living Lab EnvironmentsJournal of Data and Information Quality10.1145/3623640Online publication date: 24-Sep-2023
  • (2023)On the Reliability of User Feedback for Evaluating the Quality of Conversational AgentsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615286(4185-4189)Online publication date: 21-Oct-2023
  • (2023)A Reference-Dependent Model for Web Search EvaluationProceedings of the ACM Web Conference 202310.1145/3543507.3583551(3396-3405)Online publication date: 30-Apr-2023
  • (2023)User Behavior Simulation for Search Result Re-rankingACM Transactions on Information Systems10.1145/351146941:1(1-35)Online publication date: 20-Jan-2023
  • (2023)Constructing and meta-evaluating state-aware evaluation metrics for interactive search systemsInformation Retrieval Journal10.1007/s10791-023-09426-126:1-2Online publication date: 31-Oct-2023
  • (2023)Back to the Fundamentals: Extend the Rational AssumptionsA Behavioral Economics Approach to Interactive Information Retrieval10.1007/978-3-031-23229-9_5(131-152)Online publication date: 18-Feb-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media