research-article

Offline Evaluation by Maximum Similarity to an Ideal Ranking

Authors:

Charles L. A. Clarke,

Mark D. Smucker,

Alexandra VtyurinaAuthors Info & Claims

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

Pages 225 - 234

https://doi.org/10.1145/3340531.3411915

Published: 19 October 2020 Publication History

Abstract

NDCG and similar measures remain standard for the offline evaluation of search, recommendation, question answering and similar systems. These measures require definitions for two or more relevance levels, which human assessors then apply to judge individual documents. Due to this dependence on a definition of relevance, it can be difficult to extend these measures to account for factors beyond relevance. Rather than propose extensions to these measures, we instead propose a radical simplification to replace them. For each query, we define a set of ideal rankings and compute the maximum rank similarity between members of this set and an actual ranking generated by a system. This maximum similarity to an ideal ranking becomes our effectiveness measure, replacing NDCG and similar measures. We propose rank biased overlap (RBO) to compute this rank similarity, since it was specifically created to address the requirements of rank similarity between search results. As examples, we explore ideal rankings that account for document length, diversity, and correctness.

Supplementary Material

MP4 File (3340531.3411915.mp4)

Offline evaluation by maximum similarity to an ideal ranking

Download
11.26 MB

References

[1]

Mustafa Abualsaud, Fuat C. Beyluniovglu, Mark D. Smucker, and P. Robert Duimering. 2019. UWaterlooMDS at the TREC 2019 Decision Track. In 28th Text REtrieval Conference.

[2]

Eugene Agichtein, Eric Brill, and Susan Dumais. 2006. Improving web search ranking by incorporating user behavior information. In 29th ACM SIGIR Conference on Research and Development in Information Retrieval. 19--26.

Digital Library

[3]

Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong. 2009. Diversifying search results. In 2nd ACM International Conference on Web Search and Data Mining. 5--14.

Digital Library

[4]

Ameer Albahem, Damiano Spina, Falk Scholer, and Lawrence Cavedon. 2019. Meta-evaluation of dynamic search: How do metrics capture topical relevance, diversity and user effort?. In 41st European Conference on Information Retrieval Research. 607--620.

Digital Library

[5]

Timothy G. Armstrong, Alistair Moffat, William Webber, and Justin Zobel. 2009. Improvements that don't add up: Ad-hoc retrieval results since 1998. In 18th ACM Conference on Information and Knowledge Management. 601--610.

Digital Library

[6]

Leif Azzopardi, Paul Thomas, and Nick Craswell. 2018. Measuring the utility of search engine result pages: An Information Foraging Based Measure. In 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. 605--614.

Digital Library

[7]

Feza Baskaya, Heikki Keskustalo, and Kalervo J"arvelin. 2012. Time drives interaction: Simulating sessions in diverse searching environments. In 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 105--114.

Digital Library

[8]

Michael Bendersky, Xuanhui Wang, Marc Najork, and Donald Metzler. 2018. Learning with sparse and biased feedback for personal search. In 27th International Joint Conference on Artificial Intelligence. 5219--5223.

[9]

James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for hyper-parameter optimization. In 24th International Conference on Neural Information Processing Systems. 2546--2554.

[10]

Chris Buckley, Gerard Salton, James Allan, and Amit Singhal. 1994. Automatic query expansion using SMART: TREC 3. In 3rd Text REtrieval Conference.

[11]

Christopher J. C. Burges. 2010. From RankNet to LambdaRank to LambdaMART: An overview. Microsoft Research Technical Report MSR-TR-2010--82.

[12]

Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In 21st ACM SIGIR Conference on Research and Development in Information Retrieval. 335--336.

Digital Library

[13]

Ben Carterette. 2009. An analysis of NP-completeness in novelty and diversity ranking. In 2nd International Conference on Theory of Information Retrieval. 200--211.

Digital Library

[14]

Ben Carterette. 2011. System effectiveness, user models, and user utility: A conceptual framework for investigation. In 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 903--912.

Digital Library

[15]

Ben Carterette and Paul N. Bennett. 2008. Evaluation measures for preference judgments. In 31st ACM SIGIR Conference on Research and Development in Information Retrieval. 685--686.

[16]

Olivier Chapelle and Yi Chang. 2011. Yahoo! Learning to Rank Challenge overview. In Proceedings of the Learning to Rank Challenge (Proceedings of Machine Learning Research), Olivier Chapelle, Yi Chang, and Tie-Yan Liu (Eds.), Vol. 14. Proceedings of Machine Learning Research, 1--24.

Digital Library

[17]

Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In 18th ACM Conference on Information and Knowledge Management. 621--630.

Digital Library

[18]

Charles L. A. Clarke, Alexandra Vtyurina, and Mark D. Smucker. 2020 a. Assessing top-k preferences. arxiv: cs.IR/2007.11682

[19]

Charles L. A. Clarke, Alexandra Vtyurina, and Mark D. Smucker. 2020 b. Offline evaluation without gain. In ACM SIGIR International Conference on the Theory of Information Retrieval.

[20]

W. B. Croft and D. J. Harper. 1979. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, Vol. 35 (1979), 285--295.

[21]

J. Shane Culpepper, Fernando Diaz, and Mark D. Smucker. 2018. Research frontiers in information retrieval: Report from the third strategic Workshop on information retrieval in Lorne (SWIRL 2018). SIGIR Forum, Vol. 52, 1 (August 2018), 34--90.

Digital Library

[22]

Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. 2019. Are we really making much progress? A worrying analysis of recent neural recommendation approaches. In 13th ACM Conference on Recommender Systems. 101--109.

Digital Library

[23]

Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2019. CAsT 2019: The Conversational Assistance Track overview. In 28th Text REtrieval Conference.

[24]

Katja Hofmann, Lihong Li, and Filip Radlinski. 2016. Online evaluation for information retrieval. Foundations and Trends in Information Retrieval, Vol. 10 (January 2016), 1--117.

[25]

Kalervo J"arvelin and Jaana Kek"al"ainen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, Vol. 20, 4 (2002), 422--446.

Digital Library

[26]

Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased learning-to-rank with biased feedback. In 10th ACM International Conference on Web Search and Data Mining. 781--789.

Digital Library

[27]

Youngho Kim, Ahmed Hassan, Ryen W. White, and Imed Zitouni. 2014. Modeling dwell time to predict click-level satisfaction. In 7th ACM International Conference on Web Search and Data Mining. 193--202.

Digital Library

[28]

Jimmy Lin. 2019. The neural hype and comparisons against weak baselines. SIGIR Forum, Vol. 52, 2 (January 2019), 40--51.

Digital Library

[29]

Jiyun Luo, Christopher Wing, Hui Yang, and Marti Hearst. 2013. The water filling model and the cube test: Multi-dimensional evaluation for professional search. In 22nd ACM International Conference on Information and Knowledge Management. 709--714.

Digital Library

[30]

Alistair Moffat, Paul Thomas, and Falk Scholer. 2013. Users versus models: What observation tells us about effectiveness metrics. In 22nd ACM International Conference on Information and Knowledge Management. 659--668.

Digital Library

[31]

Kira Radinsky and Nir Ailon. 2011. Ranking from pairs and triplets: Information quality, evaluation methods and query complexity. In 4th ACM International Conference on Web Search and Data Mining. 105--114.

Digital Library

[32]

Fiana Raiber and Oren Kurland. 2013. The Technion at TREC 2013 Web Track: Cluster-based document retrieval. In 22nd Text REtrieval Conference.

[33]

S. E. Robertson. 1990. On term selection for query expansion. Journal of Documentation, Vol. 46, 4 (December 1990), 359--364.

Digital Library

[34]

S. E. Robertson and S. Walker. 1994. Some simple effective Approximations to the 2-Poisson model for probabilistic weighted retrieval. In 17th ACM SIGIR Conference on Research and Development in Information Retrieval. 232--241.

[35]

S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford. 1994. Okapi at TREC-3. In 3rd Text REtrieval Conference.

[36]

Mark E. Rorvig. 1990. The simple scalability of documents. Journal of the American Society for Information Science, Vol. 41, 8 (1990), 590--598.

[37]

Ian Ruthven and Mounia Lalmas. 2003. A survey on the use of relevance feedback for information access systems. Knowledge Engineering Review, Vol. 18, 2 (2003), 95--145.

Digital Library

[38]

Tetsuya Sakai and Zhicheng Dou. 2013. Summaries, ranked retrieval and sessions: A unified framework for information access evaluation. In 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 473--482.

Digital Library

[39]

Tetsuya Sakai and Ruihua Song. 2011. Evaluating diversified search results using per-intent graded relevance. In 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1043--1052.

Digital Library

[40]

Tetsuya Sakai and Zhaohao Zeng. 2019. Which diversity evaluation measures are “good”?. In 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 595--604.

Digital Library

[41]

Amit Singhal, Chris Buckley, and Mandar Mitra. 1996. Pivoted document length normalization. In 19th ACM SIGIR Conference on Research and Development in Information Retrieval. 21--29.

Digital Library

[42]

Mark D. Smucker and Charles L.A. Clarke. 2012. Time-based calibration of effectiveness measures. In 35th ACM SIGIR Conference on Research and Development in Information Retrieval. 95--104.

[43]

Andrew Trotman, Antti Puurula, and Blake Burgess. 2014. Improvements to BM25 and language models examined. In 2014 Australasian Document Computing Symposium. 58--65.

Digital Library

[44]

Ellen M. Voorhees and Donna K. Harman (Eds.). 2005. TREC: Experiment and Evaluation in Information Retrieval .MIT Press.

Digital Library

[45]

William Webber, Alistair Moffat, and Justin Zobel. 2010. A similarity measure for indefinite rankings. ACM Transactions on Information Systems, Vol. 28, 4 (November 2010), 20:1--20:38.

Digital Library

[46]

Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. 2019. Critically examining the “neural hype”: Weak baselines and the additivity of effectiveness gains from neural ranking models. In 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1129--1132.

Digital Library

[47]

Ziying Yang, Alistair Moffat, and Andrew Turpin. 2018. Pairwise crowd judgments: Preference, absolute, and ratio. In 23rd Australasian Document Computing Symposium.

Digital Library

[48]

Y. Y. Yao. 1995. Measuring retrieval effectiveness based on user preference of documents. Journal of the American Society for Information Science, Vol. 46, 2 (1995), 133--145.

Digital Library

[49]

Emine Yilmaz, Milad Shokouhi, Nick Craswell, and Stephen Robertson. 2010. Expected browsing utility for web search evaluation. In 19th ACM International Conference on Information and Knowledge Management. 1561--1564.

Digital Library

[50]

Cheng Xiang Zhai, William W. Cohen, and John Lafferty. 2003. Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In 26th ACM SIGIR Conference on Research and Development in Informaion Retrieval. 10--17.

Digital Library

Cited By

Zhang BNaderi NMishra RTeodoro D(2024)Online Health Search Via Multidimensional Information Quality Assessment Based on Deep Language Models: Algorithm Development and ValidationJMIR AI10.2196/426303(e42630)Online publication date: 2-May-2024
https://doi.org/10.2196/42630
Corsi MUrbano JSakai TIshita EOhshima HRadboud FMao JJose J(2024)How do Ties Affect the Uncertainty in Rank-Biased Overlap?Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698422(125-134)Online publication date: 8-Dec-2024
https://dl.acm.org/doi/10.1145/3673791.3698422
Corsi MUrbano JHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)The Treatment of Ties in Rank-Biased OverlapProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657700(251-260)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657700
Show More Cited By

Index Terms

Offline Evaluation by Maximum Similarity to an Ideal Ranking
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Offline Evaluation without Gain
ICTIR '20: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval

We propose a simple and flexible framework for offline evaluation based on a weak ordering of results (which we call "partial preferences") that define a set of ideal rankings for a query. These partial preferences can be derived from from side-by-side ...
Incorporating robustness into web ranking evaluation
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

In many Web search engines, a ranking function is selected for deployment mainly by comparing the relevance measurements over candidates. Due to the dynamical nature of the Web, the ranking features and the query and URL distribution on which the ...
PSkip: estimating relevance ranking quality from web search clickthrough data
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

In this article, we report our efforts in mining the information encoded as clickthrough data in the server logs to evaluate and monitor the relevance ranking quality of a commercial web search engine. We describe a metric called pSkip that aims to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

October 2020

3619 pages

ISBN:9781450368599

DOI:10.1145/3340531

General Chairs:
Mathieu d'Aquin
DSI, Insight, NUI Galway, Ireland
,
Stefan Dietze
GESIS, Cologne, Germany, Heinrich-Heine-University Düsseldorf, Germany, L3S Research Center, Germany
,
Program Chairs:
Claudia Hauff
TU Delft, The Netherlands
,
Edward Curry
DSI, Insight, NUI Galway, Ireland
,
Philippe Cudre Mauroux
eXascale, University of Fribourg, Switzerland

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Natural Sciences and Engineering Research Council of Canada

Conference

CIKM '20

Sponsor:

CIKM '20: The 29th ACM International Conference on Information and Knowledge Management

October 19 - 23, 2020

Virtual Event, Ireland

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
377
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)8

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang BNaderi NMishra RTeodoro D(2024)Online Health Search Via Multidimensional Information Quality Assessment Based on Deep Language Models: Algorithm Development and ValidationJMIR AI10.2196/426303(e42630)Online publication date: 2-May-2024
https://doi.org/10.2196/42630
Corsi MUrbano JSakai TIshita EOhshima HRadboud FMao JJose J(2024)How do Ties Affect the Uncertainty in Rank-Biased Overlap?Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698422(125-134)Online publication date: 8-Dec-2024
https://dl.acm.org/doi/10.1145/3673791.3698422
Corsi MUrbano JHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)The Treatment of Ties in Rank-Biased OverlapProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657700(251-260)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657700
Pradeep RLin J(2024)Towards Automated End-to-End Health Misinformation Free Search with a Large Language ModelAdvances in Information Retrieval10.1007/978-3-031-56066-8_9(78-86)Online publication date: 15-Mar-2024
https://doi.org/10.1007/978-3-031-56066-8_9
Roitero KBarbera DSoprano MDemartini GMizzaro SSakai T(2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 22-May-2023
https://dl.acm.org/doi/10.1145/3597201
Seifikar MPhan Minh LArabzadeh NClarke CSmucker MChen HDuh WHuang HKato MMothe JPoblete B(2023)A Preference Judgment Tool for Authoritative AssessmentProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591801(3100-3104)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591801
Clarke CDiaz FArabzadeh NChua TLauw HSi LTerzi ETsaparas P(2023)Preference-Based Offline EvaluationProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3572725(1248-1251)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3539597.3572725
Roitero KChecco AMizzaro SDemartini G(2022)Preferences on a Budget: Prioritizing Document Pairs when Crowdsourcing Relevance JudgmentsProceedings of the ACM Web Conference 202210.1145/3485447.3511960(319-327)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3485447.3511960
Amigó EMizzaro SSpina DAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)Ranking InterruptusProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3532051(588-598)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3532051
Fernández-Pichel MLosada DPichel J(2022)A multistage retrieval system for health-related misinformation detectionEngineering Applications of Artificial Intelligence10.1016/j.engappai.2022.105211115(105211)Online publication date: Oct-2022
https://doi.org/10.1016/j.engappai.2022.105211
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents