keynote

Proof by Experimentation? Towards Better IR Research

Author:

SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2020

Page 2

https://doi.org/10.1145/3397271.3402426

Published: 25 July 2020 Publication History

Get Access

Abstract

The current fight against the COVID-19 pandemic illustrates the importance of proper scientific methods: Besides fake news lacking any factual evidence, reports on clinical trials with various drugs often yield contradicting results; here, only a closer look at the underlying empirical methodology can help in forming a clearer picture.

In IR research, empirical foundation in the form of experiments plays an important role. However, the methods applied often are not at the level of scientific standards that hold in many other disciplines, as IR experiments are frequently flawed in several ways: Measures like MRR or ERR are invalid by definition, and MAP is based on unrealistic assumptions about user behaviour; computing relative improvements of arithmetic means is statistical nonsense; test hypotheses often are formulated after the experiment has been carried out; multiple hypotheses are tested without correction; many experiments are not reproducible results or are compared to weak baselines [1, 6]; frequent reuse of the same test collections yields random results [2]; authors (and reviewers) believe that experiments prove the claims made. Methods for overcoming these problems have been pointed out [5], but are still widely ignored.

Supplementary Material

MP4 File (3397271.3402426.mp4)

In IR research, empirical foundation in the form of experiments plays an important role. However, the methods applied often are not at the level of scientific standards that hold in many other disciplines, as IR experiments are frequently flawed in several ways, like using invalid measures, inappropriate statistical methods or comparison to weak baselines. Moreover, neither re-used test collections nor leaderboards can yield statistically valid results only evaluation campaigns or other reproduced results on new collections will lead to scientific. \r\nIn order to make statements that do not only hold for the few tested collections, IR research should focus more on external validity, by aiming at understanding why certain methods work (or don't work) under certain circumstances. Ultimately, research should aim more at performance prediction than at performance measurement.

Download
70.85 MB

References

[1]

Timothy G. Armstrong, Alistair Moffat, William Webber, and Justin Zobel. 2009. Improvements that don't add up: ad-hoc retrieval results since 1998. In CIKM, David Wai-Lok Cheung, Il-Yeol Song, Wesley W. Chu, Xiaohua Hu, and Jimmy J. Lin (Eds.). ACM, 601--610.

Google Scholar

[2]

Ben Carterette. 2015. The Best Published Result is Random: Sequential Testing and Its Effect on Reported Effectiveness. In Proc. SIGIR (SIGIR '15). ACM, New York, NY, USA, 747--750. https://doi.org/10.1145/2766462.2767812

Digital Library

Google Scholar

[3]

Nicola Ferro et al. 2018. The Dagstuhl Perspectives Workshop on Performance Modeling and Prediction. SIGIR Forum, Vol. 52, 1 (2018), 91--101.

Google Scholar

[4]

Norbert Fuhr. 2012. Salton Award Lecture: Information Retrieval As Engineering Science. SIGIR Forum, Vol. 46, 2 (Dec. 2012), 19--28. https://doi.org/10.1145/2422256.2422259

Digital Library

Google Scholar

[5]

Norbert Fuhr. 2017. Some Common Mistakes In IR Evaluation, And How They Can Be Avoided. SIGIR Forum, Vol. 51, 3 (2017), 32--41. http://sigir.org/wp-content/uploads/2018/01/p032.pdf

Digital Library

Google Scholar

[6]

Jimmy Lin. 2019. The Neural Hype and Comparisons Against Weak Baselines. SIGIR Forum, Vol. 52, 2 (Jan. 2019), 40--51. https://doi.org/10.1145/3308774.3308781

Digital Library

Google Scholar

Cited By

View all

Fröbe MReimer JMacAvaney SDeckers NReich SBevendorff JStein BHagen MPotthast MChen HDuh WHuang HKato MMothe JPoblete B(2023)The Information Retrieval Experiment PlatformProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591888(2826-2836)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591888
Macdonald CTonellotto NMacAvaney SOunis IDemartini GZuccon GCulpepper JHuang ZTong H(2021)PyTerrier: Declarative Experimentation in Python from BM25 to Dense RetrievalProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482013(4526-4533)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482013
Craswell NMitra BYilmaz ECampos DLin JDiaz FShah CSuel TCastells PJones RSakai T(2021)MS MARCO: Benchmarking Ranking Models in the Large-Data RegimeProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3462804(1566-1576)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3404835.3462804

Index Terms

Proof by Experimentation? Towards Better IR Research
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

IR between science and engineering, and the role of experimentation
CLEF'10: Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum

Evaluation has always played a major role in IR research, as a means for judging about the quality of competing models. Lately, however, we have seen an over-emphasis of experimental results, thus favoring engineering approaches aiming at tuning ...
Read More
Towards Improving Experimentation in Software Engineering
SBES '21: Proceedings of the XXXV Brazilian Symposium on Software Engineering

[Background:] Experimentation in Software Engineering plays a central role on sharing and verifying scientific findings. As experiments have increased significantly in Software Engineering area, we observe that most of them fail to provide a way to be ...
Read More
Towards reproducibility in recommender-systems research

Numerous recommendation approaches are in use today. However, comparing their effectiveness is a challenging task because evaluation results are rarely reproducible. In this article, we examine the challenge of reproducibility in recommender-system ...
Read More

Comments

Information & Contributors

Information

Published In

SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2020

2548 pages

ISBN:9781450380164

DOI:10.1145/3397271

General Chairs:
Jimmy Huang
York University, Canada
,
Yi Chang
Jilin University, China
,
Xueqi Cheng
Chinese Academy of Sciences, China
,
Program Chairs:
Jaap Kamps
University of Amsterdam, Netherlands
,
Vanessa Murdock
Amazon, U.S.A.
,
Ji-Rong Wen
Renmin University of China, China
,
Yiqun Liu
Tsinghua University, China

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2020

Check for updates

Author Tags

Qualifiers

Keynote

Conference

SIGIR '20

Sponsor:

SIGIR

SIGIR '20: The 43rd International ACM SIGIR conference on research and development in Information Retrieval

July 25 - 30, 2020

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
233
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)2

Other Metrics

View Author Metrics

Citations

Cited By

View all

Fröbe MReimer JMacAvaney SDeckers NReich SBevendorff JStein BHagen MPotthast MChen HDuh WHuang HKato MMothe JPoblete B(2023)The Information Retrieval Experiment PlatformProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591888(2826-2836)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591888
Macdonald CTonellotto NMacAvaney SOunis IDemartini GZuccon GCulpepper JHuang ZTong H(2021)PyTerrier: Declarative Experimentation in Python from BM25 to Dense RetrievalProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482013(4526-4533)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482013
Craswell NMitra BYilmaz ECampos DLin JDiaz FShah CSuel TCastells PJones RSakai T(2021)MS MARCO: Benchmarking Ranking Models in the Large-Data RegimeProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3462804(1566-1576)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3404835.3462804

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

IR between science and engineering, and the role of experimentation

Towards Improving Experimentation in Software Engineering

Towards reproducibility in recommender-systems research