Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3397271.3402426acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
keynote

Proof by Experimentation? Towards Better IR Research

Published: 25 July 2020 Publication History
  • Get Citation Alerts
  • Abstract

    The current fight against the COVID-19 pandemic illustrates the importance of proper scientific methods: Besides fake news lacking any factual evidence, reports on clinical trials with various drugs often yield contradicting results; here, only a closer look at the underlying empirical methodology can help in forming a clearer picture.
    In IR research, empirical foundation in the form of experiments plays an important role. However, the methods applied often are not at the level of scientific standards that hold in many other disciplines, as IR experiments are frequently flawed in several ways: Measures like MRR or ERR are invalid by definition, and MAP is based on unrealistic assumptions about user behaviour; computing relative improvements of arithmetic means is statistical nonsense; test hypotheses often are formulated after the experiment has been carried out; multiple hypotheses are tested without correction; many experiments are not reproducible results or are compared to weak baselines [1, 6]; frequent reuse of the same test collections yields random results [2]; authors (and reviewers) believe that experiments prove the claims made. Methods for overcoming these problems have been pointed out [5], but are still widely ignored.

    Supplementary Material

    MP4 File (3397271.3402426.mp4)
    In IR research, empirical foundation in the form of experiments plays an important role. However, the methods applied often are not at the level of scientific standards that hold in many other disciplines, as IR experiments are frequently flawed in several ways, like using invalid measures, inappropriate statistical methods or comparison to weak baselines. Moreover, neither re-used test collections nor leaderboards can yield statistically valid results only evaluation campaigns or other reproduced results on new collections will lead to scientific. \r\nIn order to make statements that do not only hold for the few tested collections, IR research should focus more on external validity, by aiming at understanding why certain methods work (or don't work) under certain circumstances. Ultimately, research should aim more at performance prediction than at performance measurement.

    References

    [1]
    Timothy G. Armstrong, Alistair Moffat, William Webber, and Justin Zobel. 2009. Improvements that don't add up: ad-hoc retrieval results since 1998. In CIKM, David Wai-Lok Cheung, Il-Yeol Song, Wesley W. Chu, Xiaohua Hu, and Jimmy J. Lin (Eds.). ACM, 601--610.
    [2]
    Ben Carterette. 2015. The Best Published Result is Random: Sequential Testing and Its Effect on Reported Effectiveness. In Proc. SIGIR (SIGIR '15). ACM, New York, NY, USA, 747--750. https://doi.org/10.1145/2766462.2767812
    [3]
    Nicola Ferro et al. 2018. The Dagstuhl Perspectives Workshop on Performance Modeling and Prediction. SIGIR Forum, Vol. 52, 1 (2018), 91--101.
    [4]
    Norbert Fuhr. 2012. Salton Award Lecture: Information Retrieval As Engineering Science. SIGIR Forum, Vol. 46, 2 (Dec. 2012), 19--28. https://doi.org/10.1145/2422256.2422259
    [5]
    Norbert Fuhr. 2017. Some Common Mistakes In IR Evaluation, And How They Can Be Avoided. SIGIR Forum, Vol. 51, 3 (2017), 32--41. http://sigir.org/wp-content/uploads/2018/01/p032.pdf
    [6]
    Jimmy Lin. 2019. The Neural Hype and Comparisons Against Weak Baselines. SIGIR Forum, Vol. 52, 2 (Jan. 2019), 40--51. https://doi.org/10.1145/3308774.3308781

    Cited By

    View all
    • (2023)The Information Retrieval Experiment PlatformProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591888(2826-2836)Online publication date: 19-Jul-2023
    • (2021)PyTerrier: Declarative Experimentation in Python from BM25 to Dense RetrievalProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482013(4526-4533)Online publication date: 26-Oct-2021
    • (2021)MS MARCO: Benchmarking Ranking Models in the Large-Data RegimeProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3462804(1566-1576)Online publication date: 11-Jul-2021

    Index Terms

    1. Proof by Experimentation? Towards Better IR Research

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
      July 2020
      2548 pages
      ISBN:9781450380164
      DOI:10.1145/3397271
      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 25 July 2020

      Check for updates

      Author Tags

      1. evaluation methodology
      2. performance prediction

      Qualifiers

      • Keynote

      Conference

      SIGIR '20
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 792 of 3,983 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)6
      • Downloads (Last 6 weeks)2

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)The Information Retrieval Experiment PlatformProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591888(2826-2836)Online publication date: 19-Jul-2023
      • (2021)PyTerrier: Declarative Experimentation in Python from BM25 to Dense RetrievalProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482013(4526-4533)Online publication date: 26-Oct-2021
      • (2021)MS MARCO: Benchmarking Ranking Models in the Large-Data RegimeProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3462804(1566-1576)Online publication date: 11-Jul-2021

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media