Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3340531.3411915acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Offline Evaluation by Maximum Similarity to an Ideal Ranking

Published: 19 October 2020 Publication History

Abstract

NDCG and similar measures remain standard for the offline evaluation of search, recommendation, question answering and similar systems. These measures require definitions for two or more relevance levels, which human assessors then apply to judge individual documents. Due to this dependence on a definition of relevance, it can be difficult to extend these measures to account for factors beyond relevance. Rather than propose extensions to these measures, we instead propose a radical simplification to replace them. For each query, we define a set of ideal rankings and compute the maximum rank similarity between members of this set and an actual ranking generated by a system. This maximum similarity to an ideal ranking becomes our effectiveness measure, replacing NDCG and similar measures. We propose rank biased overlap (RBO) to compute this rank similarity, since it was specifically created to address the requirements of rank similarity between search results. As examples, we explore ideal rankings that account for document length, diversity, and correctness.

Supplementary Material

MP4 File (3340531.3411915.mp4)
Offline evaluation by maximum similarity to an ideal ranking

References

[1]
Mustafa Abualsaud, Fuat C. Beyluniovglu, Mark D. Smucker, and P. Robert Duimering. 2019. UWaterlooMDS at the TREC 2019 Decision Track. In 28th Text REtrieval Conference.
[2]
Eugene Agichtein, Eric Brill, and Susan Dumais. 2006. Improving web search ranking by incorporating user behavior information. In 29th ACM SIGIR Conference on Research and Development in Information Retrieval. 19--26.
[3]
Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong. 2009. Diversifying search results. In 2nd ACM International Conference on Web Search and Data Mining. 5--14.
[4]
Ameer Albahem, Damiano Spina, Falk Scholer, and Lawrence Cavedon. 2019. Meta-evaluation of dynamic search: How do metrics capture topical relevance, diversity and user effort?. In 41st European Conference on Information Retrieval Research. 607--620.
[5]
Timothy G. Armstrong, Alistair Moffat, William Webber, and Justin Zobel. 2009. Improvements that don't add up: Ad-hoc retrieval results since 1998. In 18th ACM Conference on Information and Knowledge Management. 601--610.
[6]
Leif Azzopardi, Paul Thomas, and Nick Craswell. 2018. Measuring the utility of search engine result pages: An Information Foraging Based Measure. In 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. 605--614.
[7]
Feza Baskaya, Heikki Keskustalo, and Kalervo J"arvelin. 2012. Time drives interaction: Simulating sessions in diverse searching environments. In 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 105--114.
[8]
Michael Bendersky, Xuanhui Wang, Marc Najork, and Donald Metzler. 2018. Learning with sparse and biased feedback for personal search. In 27th International Joint Conference on Artificial Intelligence. 5219--5223.
[9]
James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for hyper-parameter optimization. In 24th International Conference on Neural Information Processing Systems. 2546--2554.
[10]
Chris Buckley, Gerard Salton, James Allan, and Amit Singhal. 1994. Automatic query expansion using SMART: TREC 3. In 3rd Text REtrieval Conference.
[11]
Christopher J. C. Burges. 2010. From RankNet to LambdaRank to LambdaMART: An overview. Microsoft Research Technical Report MSR-TR-2010--82.
[12]
Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In 21st ACM SIGIR Conference on Research and Development in Information Retrieval. 335--336.
[13]
Ben Carterette. 2009. An analysis of NP-completeness in novelty and diversity ranking. In 2nd International Conference on Theory of Information Retrieval. 200--211.
[14]
Ben Carterette. 2011. System effectiveness, user models, and user utility: A conceptual framework for investigation. In 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 903--912.
[15]
Ben Carterette and Paul N. Bennett. 2008. Evaluation measures for preference judgments. In 31st ACM SIGIR Conference on Research and Development in Information Retrieval. 685--686.
[16]
Olivier Chapelle and Yi Chang. 2011. Yahoo! Learning to Rank Challenge overview. In Proceedings of the Learning to Rank Challenge (Proceedings of Machine Learning Research), Olivier Chapelle, Yi Chang, and Tie-Yan Liu (Eds.), Vol. 14. Proceedings of Machine Learning Research, 1--24.
[17]
Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In 18th ACM Conference on Information and Knowledge Management. 621--630.
[18]
Charles L. A. Clarke, Alexandra Vtyurina, and Mark D. Smucker. 2020 a. Assessing top-k preferences. arxiv: cs.IR/2007.11682
[19]
Charles L. A. Clarke, Alexandra Vtyurina, and Mark D. Smucker. 2020 b. Offline evaluation without gain. In ACM SIGIR International Conference on the Theory of Information Retrieval.
[20]
W. B. Croft and D. J. Harper. 1979. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, Vol. 35 (1979), 285--295.
[21]
J. Shane Culpepper, Fernando Diaz, and Mark D. Smucker. 2018. Research frontiers in information retrieval: Report from the third strategic Workshop on information retrieval in Lorne (SWIRL 2018). SIGIR Forum, Vol. 52, 1 (August 2018), 34--90.
[22]
Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. 2019. Are we really making much progress? A worrying analysis of recent neural recommendation approaches. In 13th ACM Conference on Recommender Systems. 101--109.
[23]
Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2019. CAsT 2019: The Conversational Assistance Track overview. In 28th Text REtrieval Conference.
[24]
Katja Hofmann, Lihong Li, and Filip Radlinski. 2016. Online evaluation for information retrieval. Foundations and Trends in Information Retrieval, Vol. 10 (January 2016), 1--117.
[25]
Kalervo J"arvelin and Jaana Kek"al"ainen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, Vol. 20, 4 (2002), 422--446.
[26]
Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased learning-to-rank with biased feedback. In 10th ACM International Conference on Web Search and Data Mining. 781--789.
[27]
Youngho Kim, Ahmed Hassan, Ryen W. White, and Imed Zitouni. 2014. Modeling dwell time to predict click-level satisfaction. In 7th ACM International Conference on Web Search and Data Mining. 193--202.
[28]
Jimmy Lin. 2019. The neural hype and comparisons against weak baselines. SIGIR Forum, Vol. 52, 2 (January 2019), 40--51.
[29]
Jiyun Luo, Christopher Wing, Hui Yang, and Marti Hearst. 2013. The water filling model and the cube test: Multi-dimensional evaluation for professional search. In 22nd ACM International Conference on Information and Knowledge Management. 709--714.
[30]
Alistair Moffat, Paul Thomas, and Falk Scholer. 2013. Users versus models: What observation tells us about effectiveness metrics. In 22nd ACM International Conference on Information and Knowledge Management. 659--668.
[31]
Kira Radinsky and Nir Ailon. 2011. Ranking from pairs and triplets: Information quality, evaluation methods and query complexity. In 4th ACM International Conference on Web Search and Data Mining. 105--114.
[32]
Fiana Raiber and Oren Kurland. 2013. The Technion at TREC 2013 Web Track: Cluster-based document retrieval. In 22nd Text REtrieval Conference.
[33]
S. E. Robertson. 1990. On term selection for query expansion. Journal of Documentation, Vol. 46, 4 (December 1990), 359--364.
[34]
S. E. Robertson and S. Walker. 1994. Some simple effective Approximations to the 2-Poisson model for probabilistic weighted retrieval. In 17th ACM SIGIR Conference on Research and Development in Information Retrieval. 232--241.
[35]
S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford. 1994. Okapi at TREC-3. In 3rd Text REtrieval Conference.
[36]
Mark E. Rorvig. 1990. The simple scalability of documents. Journal of the American Society for Information Science, Vol. 41, 8 (1990), 590--598.
[37]
Ian Ruthven and Mounia Lalmas. 2003. A survey on the use of relevance feedback for information access systems. Knowledge Engineering Review, Vol. 18, 2 (2003), 95--145.
[38]
Tetsuya Sakai and Zhicheng Dou. 2013. Summaries, ranked retrieval and sessions: A unified framework for information access evaluation. In 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 473--482.
[39]
Tetsuya Sakai and Ruihua Song. 2011. Evaluating diversified search results using per-intent graded relevance. In 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1043--1052.
[40]
Tetsuya Sakai and Zhaohao Zeng. 2019. Which diversity evaluation measures are “good”?. In 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 595--604.
[41]
Amit Singhal, Chris Buckley, and Mandar Mitra. 1996. Pivoted document length normalization. In 19th ACM SIGIR Conference on Research and Development in Information Retrieval. 21--29.
[42]
Mark D. Smucker and Charles L.A. Clarke. 2012. Time-based calibration of effectiveness measures. In 35th ACM SIGIR Conference on Research and Development in Information Retrieval. 95--104.
[43]
Andrew Trotman, Antti Puurula, and Blake Burgess. 2014. Improvements to BM25 and language models examined. In 2014 Australasian Document Computing Symposium. 58--65.
[44]
Ellen M. Voorhees and Donna K. Harman (Eds.). 2005. TREC: Experiment and Evaluation in Information Retrieval .MIT Press.
[45]
William Webber, Alistair Moffat, and Justin Zobel. 2010. A similarity measure for indefinite rankings. ACM Transactions on Information Systems, Vol. 28, 4 (November 2010), 20:1--20:38.
[46]
Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. 2019. Critically examining the “neural hype”: Weak baselines and the additivity of effectiveness gains from neural ranking models. In 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1129--1132.
[47]
Ziying Yang, Alistair Moffat, and Andrew Turpin. 2018. Pairwise crowd judgments: Preference, absolute, and ratio. In 23rd Australasian Document Computing Symposium.
[48]
Y. Y. Yao. 1995. Measuring retrieval effectiveness based on user preference of documents. Journal of the American Society for Information Science, Vol. 46, 2 (1995), 133--145.
[49]
Emine Yilmaz, Milad Shokouhi, Nick Craswell, and Stephen Robertson. 2010. Expected browsing utility for web search evaluation. In 19th ACM International Conference on Information and Knowledge Management. 1561--1564.
[50]
Cheng Xiang Zhai, William W. Cohen, and John Lafferty. 2003. Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In 26th ACM SIGIR Conference on Research and Development in Informaion Retrieval. 10--17.

Cited By

View all
  • (2024)Online Health Search Via Multidimensional Information Quality Assessment Based on Deep Language Models: Algorithm Development and ValidationJMIR AI10.2196/426303(e42630)Online publication date: 2-May-2024
  • (2024)How do Ties Affect the Uncertainty in Rank-Biased Overlap?Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698422(125-134)Online publication date: 8-Dec-2024
  • (2024)The Treatment of Ties in Rank-Biased OverlapProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657700(251-260)Online publication date: 10-Jul-2024
  • Show More Cited By

Index Terms

  1. Offline Evaluation by Maximum Similarity to an Ideal Ranking

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management
    October 2020
    3619 pages
    ISBN:9781450368599
    DOI:10.1145/3340531
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 October 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. NDCG
    2. decision support
    3. diversity
    4. document length normalization
    5. preference judgments
    6. rank similarity
    7. search evaluation

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    CIKM '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)36
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 25 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Online Health Search Via Multidimensional Information Quality Assessment Based on Deep Language Models: Algorithm Development and ValidationJMIR AI10.2196/426303(e42630)Online publication date: 2-May-2024
    • (2024)How do Ties Affect the Uncertainty in Rank-Biased Overlap?Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698422(125-134)Online publication date: 8-Dec-2024
    • (2024)The Treatment of Ties in Rank-Biased OverlapProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657700(251-260)Online publication date: 10-Jul-2024
    • (2024)Towards Automated End-to-End Health Misinformation Free Search with a Large Language ModelAdvances in Information Retrieval10.1007/978-3-031-56066-8_9(78-86)Online publication date: 15-Mar-2024
    • (2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 22-May-2023
    • (2023)A Preference Judgment Tool for Authoritative AssessmentProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591801(3100-3104)Online publication date: 19-Jul-2023
    • (2023)Preference-Based Offline EvaluationProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3572725(1248-1251)Online publication date: 27-Feb-2023
    • (2022)Preferences on a Budget: Prioritizing Document Pairs when Crowdsourcing Relevance JudgmentsProceedings of the ACM Web Conference 202210.1145/3485447.3511960(319-327)Online publication date: 25-Apr-2022
    • (2022)Ranking InterruptusProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3532051(588-598)Online publication date: 6-Jul-2022
    • (2022)A multistage retrieval system for health-related misinformation detectionEngineering Applications of Artificial Intelligence10.1016/j.engappai.2022.105211115(105211)Online publication date: Oct-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media