short-paper

Open access

DiffIR: Exploring Differences in Ranking Models' Behavior

Authors:

Kevin Martin Jose,

Sean MacAvaney,

Jeffrey Dalton,

Andrew YatesAuthors Info & Claims

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 2595 - 2599

https://doi.org/10.1145/3404835.3462784

Published: 11 July 2021 Publication History

Abstract

Understanding and comparing the behavior of retrieval models is a fundamental challenge that requires going beyond examining average effectiveness and per-query metrics, because these do not reveal key differences in how ranking models' behavior impacts individual results. DiffIR is a new open-source web tool to assist with qualitative ranking analysis by visually 'diffing' system rankings at the individual result level for queries where behavior significantly diverges. Using one of several configurable similarity measures, it identifies queries for which the rankings of models compared have important differences in individual rankings and provides a visual web interface to compare the rankings side-by-side. DiffIR additionally supports a model-specific visualization approach based on custom term importance weight files. These support studying the behavior of interpretable models, such as neural retrieval methods that produce document scores based on a similarity matrix or based on a single document passage. Observations from this tool can complement neural probing approaches like ABNIRML to generate quantitative tests. We provide an illustrative use case of DiffIR by studying the qualitative differences between recently developed neural ranking models on a standard TREC benchmark dataset.

References

[1]

Alfred V Aho and Margaret J Corasick. 1975. Efficient string matching: an aid to bibliographic search. Commun. ACM, Vol. 18, 6 (1975), 333--340.

Digital Library

[2]

Zeynep Akkalyoncu Yilmaz, Wei Yang, Haotian Zhang, and Jimmy Lin. 2019. Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China, 3490--3496.

[3]

Azzah Al-Maskari, Mark Sanderson, Paul Clough, and Eija Airio. 2008. The good and the bad system: does the test collection predict users' effectiveness?. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. 59--66.

Digital Library

[4]

Peter Bailey, Nick Craswell, Ryen W White, Liwei Chen, Ashwin Satyanarayana, and Seyed MM Tahaghoghi. 2010. Evaluating search systems using result page context. In Proceedings of the third symposium on Information interaction in context.

Digital Library

[5]

Ben Carterette, James Allan, and Ramesh Sitaraman. 2006. Minimal Test Collections for Retrieval Evaluation. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Seattle, Washington, USA) (SIGIR '06). Association for Computing Machinery, New York, NY, USA, 268--275. https://doi.org/10.1145/1148170.1148219

Digital Library

[6]

Ben Carterette, Paul N. Bennett, David Maxwell Chickering, and Susan T. Dumais. 2008. Here or There: Preference Judgments for Relevance. In Proceedings of the IR Research, 30th European Conference on Advances in Information Retrieval (Glasgow, UK) (ECIR'08). Springer-Verlag, Berlin, Heidelberg, 16--27.

[7]

Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In Proceedings of the 8th International Conference on Learning Representations (ICLR 2020).

[8]

Zhuyun Dai and Jamie Callan. 2019 a. Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval. In arXiv:1910.10687.

[9]

Zhuyun Dai and Jamie Callan. 2019 b. Deeper Text Understanding for IR with Contextual Neural Language Modeling. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval.

Digital Library

[10]

Hui Fang, Tao Tao, and Chengxiang Zhai. 2011. Diagnostic evaluation of information retrieval models. ACM Transactions on Information Systems (TOIS), Vol. 29, 2 (2011), 1--42.

Digital Library

[11]

Marco Ferrante, Nicola Ferro, and Norbert Fuhr. 2021. Towards Meaningful Statements in IR Evaluation. Mapping Evaluation Measures to Interval Scales. arXiv preprint arXiv:2101.02668 (2021).

[12]

N. Fuhr. 2018. Some Common Mistakes In IR Evaluation, And How They Can Be Avoided. SIGIR Forum, Vol. 51 (2018), 32--41.

Digital Library

[13]

Ning Gao, Mossaab Bagdouri, and Douglas W Oard. 2016. Pearson rank: a head-weighted gap-sensitive score-based correlation coefficient. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 941--944.

Digital Library

[14]

Ning Gao and Douglas Oard. 2015. A head-weighted gap-sensitive correlation coefficient. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 799--802.

Digital Library

[15]

Peter B Golbus, Imed Zitouni, Jin Young Kim, Ahmed Hassan, and Fernando Diaz. 2014. Contextual and dimensional relevance judgments for reusable SERP-level evaluation. In Proceedings of the 23rd international conference on World wide web.

Digital Library

[16]

John Guiver, Stefano Mizzaro, and Stephen Robertson. 2009. A few good topics: Experiments in topic set reduction for retrieval evaluation. ACM Transactions on Information Systems (TOIS), Vol. 27, 4 (2009), 1--26.

Digital Library

[17]

Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. Bruce Croft. 2016. A Deep Relevance Matching Model for Ad-Hoc Retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM 2016). Indianapolis, Indiana, 55--64.

Digital Library

[18]

Sebastian Hofst"atter, Markus Zlabinger, and Allan Hanbury. 2020. Interpretable & Time-Budget-Constrained Contextualization for Re-Ranking. In Proceedings of the 24th European Conference on Artificial Intelligence (ECAI 2020).

[19]

Sebastian Hofstätter, Markus Zlabinger, and Allan Hanbury. 2019 a. Neural-IR-Explorer: A Content-Focused Tool to Explore Neural Re-Ranking Results. arxiv: 1912.04713 [cs.IR]

[20]

Sebastian Hofstätter, Markus Zlabinger, and Allan Hanbury. 2019 b. TU Wien @ TREC Deep Learning '19 -- Simple Contextualization for Re-ranking. arxiv: 1912.01385 [cs.IR]

[21]

Mehdi Hosseini, Ingemar J Cox, Natasa Milic-Frayling, Vishwa Vinay, and Trevor Sweeting. 2011. Selecting a subset of queries for acquisition of further relevance judgements. In Conference on the theory of information retrieval. Springer.

[22]

Kai Hui, Andrew Yates, Klaus Berberich, and Gerard De Melo. 2018. Co-PACRR: A context-aware neural IR model for ad-hoc retrieval. In Proceedings of the eleventh ACM international conference on web search and data mining. 279--287.

Digital Library

[23]

Gabriella Kazai and Homer Sung. 2014. Dissimilarity based query selection for efficient preference based IR evaluation. In European conference on information retrieval. Springer, 172--183.

[24]

Mucahid Kutlu, Tamer Elsayed, Maram Hasanain, and Matthew Lease. 2018. When rank order isn't enough: New statistical-significance-aware correlation measures. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 397--406.

Digital Library

[25]

Mucahid Kutlu, Tamer Elsayed, and Matthew Lease. 2017. Learning to effectively select topics for information retrieval test collections. arXiv:1701.07810 (2017).

[26]

Canjia Li, Andrew Yates, Sean MacAvaney, Ben He, and Yingfei Sun. 2020. PARADE: Passage representation aggregation for document reranking. arXiv preprint arXiv:2008.09093 (2020).

[27]

Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2020. Pretrained transformers for text ranking: Bert and beyond. arXiv:2010.06467 (2020).

[28]

Sean MacAvaney. 2020. OpenNIR: A Complete Neural Ad-Hoc Ranking Pipeline. In Proceedings of the Thirteenth ACM International Conference on Web Search and Data Mining. 845--848. https://doi.org/10.1145/3336191.3371864

Digital Library

[29]

Sean MacAvaney, Sergey Feldman, Nazli Goharian, Doug Downey, and Arman Cohan. 2020 a. ABNIRML: Analyzing the Behavior of Neural IR Models. arXiv:2011.00696 (2020).

[30]

Sean MacAvaney, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto, Nazli Goharian, and Ophir Frieder. 2020 b. Expansion via Prediction of Importance with Contextualization. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2020) (Virtual Event, China). 1573--1576.

Digital Library

[31]

Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. 2019. CEDR: Contextualized Embeddings for Document Ranking. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1101--1104. https://doi.org/10.1145/3331184.3331317

Digital Library

[32]

Sean MacAvaney, Andrew Yates, Sergey Feldman, Doug Downey, Arman Cohan, and Nazli Goharian. 2021. Simplified Data Wrangling with ir_datasets. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. https://doi.org/10.1145/3404835.3463254

Digital Library

[33]

C. MacDonald and N. Tonellotto. 2020. Declarative Experimentation in Information Retrieval using PyTerrier. Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval (2020).

Digital Library

[34]

Bhaskar Mitra, Corby Rosset, David Hawking, Nick Craswell, Fernando Diaz, and Emine Yilmaz. 2019. Incorporating query term independence assumption for efficient retrieval and ranking using deep neural networks. arXiv:1907.03693 (2019).

[35]

Daniël Rennings, Felipe Moraes, and Claudia Hauff. 2019. An axiomatic approach to diagnosing neural IR models. In European Conference on Information Retrieval. Springer, 489--503.

Digital Library

[36]

T. Sakai and N. Kando. 2008. On information retrieval metrics designed for evaluation with incomplete relevance assessments. Information Retrieval, Vol. 11 (2008), 447--470.

Digital Library

[37]

Mark Sanderson, Monica Lestari Paramita, Paul Clough, and Evangelos Kanoulas. 2010. Do user preferences and evaluation measures line up?. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. 555--562.

Digital Library

[38]

Paul Thomas and David Hawking. 2006. Evaluation by comparing result sets in context. In Proceedings of the 15th ACM international conference on Information and knowledge management. 94--101.

Digital Library

[39]

Christophe Van Gysel and Maarten de Rijke. 2018. Pytrec_eval: An Extremely Fast Python Interface to trec_eval. In SIGIR. ACM.

[40]

Sebastiano Vigna. 2015. A weighted correlation index for rankings with ties. In Proceedings of the 24th international conference on World Wide Web. 1166--1176.

Digital Library

[41]

E. Voorhees. 2019. The Evolution of Cranfield. In Information Retrieval Evaluation in a Changing World.

[42]

Ellen M. Voorhees. 2004. Overview of the TREC 2004 Robust Track. In Proceedings of the Thirteenth Text REtrieval Conference (TREC 2004). Gaithersburg, Maryland, 52--69.

[43]

William Webber, Alistair Moffat, and Justin Zobel. 2010. A similarity measure for indefinite rankings. ACM Transactions on Information Systems (TOIS), Vol. 28, 4 (2010), 1--38.

Digital Library

[44]

Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017. End-to-End Neural Ad-Hoc Ranking with Kernel Pooling. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017). Tokyo, Japan, 55--64.

Digital Library

[45]

Andrew Yates, Kevin Martin Jose, Xinyu Zhang, and Jimmy Lin. 2020. Flexible IR pipelines with Capreolus. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 3181--3188.

Digital Library

[46]

Emine Yilmaz, Javed A Aslam, and Stephen Robertson. 2008. A new rank correlation coefficient for information retrieval. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. 587--594.

Digital Library

Cited By

Rus CPoerwawinata GYates Ade Rijke MSerra ESpezzano F(2024)AnnoRank: A Comprehensive Web-Based Framework for Collecting Annotations and Assessing RankingsProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679174(5400-5404)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679174
Fröbe MScells HElstner TAkiki CGienapp LReimer JMacAvaney SStein BHagen MPotthast MHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Resources for Combining Teaching and Research in Information Retrieval CourseworkProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657886(1115-1125)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657886
Fröbe MReimer JMacAvaney SDeckers NReich SBevendorff JStein BHagen MPotthast MChen HDuh WHuang HKato MMothe JPoblete B(2023)The Information Retrieval Experiment PlatformProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591888(2826-2836)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591888
Show More Cited By

Index Terms

DiffIR: Exploring Differences in Ranking Models' Behavior
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Retrieval effectiveness

Recommendations

An Empirical Evaluation on Semantic Search Performance of Keyword-Based and Semantic Search Engines: Google, Yahoo, Msn and Hakia
ICIMP '09: Proceedings of the 2009 Fourth International Conference on Internet Monitoring and Protection

This paper investigates the semantic search performance of search engines. Initially, three keyword-based search engines (Google, Yahoo and Msn) and a semantic search engine (Hakia) were selected. Then, ten queries, from various topics, and four phrases,...
Evaluating leading web search engines on children's queries
HCII'11: Proceedings of the 14th international conference on Human-computer interaction: users and applications - Volume Part IV

This study compared retrieved results, relevance ranking, and overlap across Google, Yahoo!, Bing, Yahoo Kids!, and Ask Kids on 15 queries constructed by middle school children. Queries included one word, two words, and multiple words/phrases/natural ...
Identifying popular search goals behind search queries to improve web search ranking
AIRS'11: Proceedings of the 7th Asia conference on Information Retrieval Technology

Web users usually have a certain search goal before they submit a search query. However, many laypersons can't transform their search goals into suitable queries. Thus, understanding original search goals behind a query is very important for search ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2021

2998 pages

ISBN:9781450380379

DOI:10.1145/3404835

General Chairs:
Fernando Diaz
(Google)
,
Chirag Shah
University of Washington
,
Torsten Suel
New York University
,
Program Chairs:
Pablo Castells
Universidad Autónoma de Madrid, Amazon
,
Rosie Jones
Spotify
,
Tetsuya Sakai
Waseda University

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

SIGIR '21

Sponsor:

SIGIR

SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 11 - 15, 2021

Virtual Event, Canada

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
558
Total Downloads

Downloads (Last 12 months)138
Downloads (Last 6 weeks)27

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Rus CPoerwawinata GYates Ade Rijke MSerra ESpezzano F(2024)AnnoRank: A Comprehensive Web-Based Framework for Collecting Annotations and Assessing RankingsProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679174(5400-5404)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679174
Fröbe MScells HElstner TAkiki CGienapp LReimer JMacAvaney SStein BHagen MPotthast MHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Resources for Combining Teaching and Research in Information Retrieval CourseworkProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657886(1115-1125)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657886
Fröbe MReimer JMacAvaney SDeckers NReich SBevendorff JStein BHagen MPotthast MChen HDuh WHuang HKato MMothe JPoblete B(2023)The Information Retrieval Experiment PlatformProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591888(2826-2836)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591888
González-Sáez GGaluscáková PDeveaud RGoeuriot LMulhem PChen HDuh WHuang HKato MMothe JPoblete B(2023)Exploratory Visualization Tool for the Continuous Evaluation of Information Retrieval SystemsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591825(3220-3224)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591825
MacAvaney SMacdonald COunis I(2022)Streamlining Evaluation with ir-measuresAdvances in Information Retrieval10.1007/978-3-030-99739-7_38(305-310)Online publication date: 10-Apr-2022
https://dl.acm.org/doi/10.1007/978-3-030-99739-7_38
MacAvaney SYates AFeldman SDowney DCohan AGoharian NDiaz FShah CSuel TCastells PJones RSakai T(2021)Simplified Data Wrangling with ir_datasetsProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463254(2429-2436)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3404835.3463254

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten