Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3397271.3401036acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

How to Measure the Reproducibility of System-oriented IR Experiments

Published: 25 July 2020 Publication History

Abstract

Replicability and reproducibility of experimental results are primary concerns in all the areas of science and IR is not an exception. Besides the problem of moving the field towards more reproducible experimental practices and protocols, we also face a severe methodological issue: we do not have any means to assess when reproduced is reproduced. Moreover, we lack any reproducibility-oriented dataset, which would allow us to develop such methods.
To address these issues, we compare several measures to objectively quantify to what extent we have replicated or reproduced a system-oriented IR experiment. These measures operate at different levels of granularity, from the fine-grained comparison of ranked lists, to the more general comparison of the obtained effects and significant differences. Moreover, we also develop a reproducibility-oriented dataset, which allows us to validate our measures and which can also be used to develop future measures.

Supplementary Material

MP4 File (3397271.3401036.mp4)
Replicability and reproducibility of experimental results are primary concerns in all the areas of science and IR is not an exception. Besides the problem of moving the field towards more reproducible experimental practices and protocols, we also face a severe methodological issue: we do not have any means to assess when reproduced is reproduced. Moreover, we lack any reproducibility-oriented dataset, which would allow us to develop such methods. To address these issues, we compare several measures to objectively quantify to what extent we have replicated or reproduced a system-oriented IR experiment. These measures operate at different levels of granularity, from the fine-grained comparison of ranked lists, to the more general comparison of the obtained effects and significant differences. Moreover, we also develop a reproducibility-oriented dataset, which allows us to validate our measures and which can also be used to develop future measures.

References

[1]
J. Allan, D. K. Harman, E. Kanoulas, D. Li, C. Van Gysel, and E. M. Voorhees. 2018. TREC 2017 Common Core Track Overview. In The Twenty-Sixth Text REtrieval Conference Proceedings (TREC 2017), E. M. Voorhees and A. Ellis (Eds.). National Institute of Standards and Technology (NIST), Special Publication 500--324, Washington, USA.
[2]
J. Arguello, M. Crane, F. Diaz, J. Lin, and A. Trotman. 2015. Report on the SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR). SIGIR Forum, Vol. 49, 2 (December 2015), 107--116.
[3]
M. Baker. 2016. 1,500 Scientists Lift the Lid on Reproducibility. Nature, Vol. 533 (May 2016), 452--454.
[4]
T. Breuer, N. Ferro, N. Fuhr, M. Maistro, T. Sakai, P. Schaer, and I. Soboroff. 2020. How to Measure the Reproducibility of System-oriented IR Experiments. https://doi.org/10.5281/zenodo.3856042
[5]
T. Breuer and P. Schaer. 2019. Replicability and Reproducibility of Automatic Routing Runs. In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019 (CEUR Workshop Proceedings), Linda Cappellato, Nicola Ferro, David E. Losada, and Henning Mü ller (Eds.), Vol. 2380. CEUR-WS.org. http://ceur-ws.org/Vol-2380/paper_84.pdf
[6]
T.-S. Chua, M.-K. Leong, D. W. Oard, and F. Sebastiani (Eds.). 2008. Proc. 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008). ACM Press, New York, USA.
[7]
R. Clancy, N. Ferro, C. Hauff, T. Sakai, and Z. Z. Wu. 2019. Overview of the 2019 Open-Source IR Replicability Challenge (OSIRRC 2019). In Proc. of the Open-Source IR Replicability Challenge (OSIRRC 2019), R. Clancy, N. Ferro, C. Hauff, T. Sakai, and Z. Z. Wu (Eds.). CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073, http://ceur-ws.org/Vol-2409/, 1--7.
[8]
M. Crane. 2018. Questionable Answers in Question Answering Research: Reproducibility and Variability of Published Results. Transactions of the Association for Computational Linguistics (TACL), Vol. 6 (2018), 241--252.
[9]
C. Croux and C. Dehon. 2010. Influence Functions of the Spearman and Kendall Correlation Measures. Statistical Methods & Applications, Vol. 19 (2010), 497--515.
[10]
M. F. Dacrema, P. Cremonesi, and D. Jannach. 2019. Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches. In Proc. 13th ACM Conference on Recommender Systems, (RecSys 2019), T. Bogers, A. Said, P. Brusilovsky, and D. Tikk (Eds.). ACM Press, New York, USA, 101--109.
[11]
D. De Roure. 2014. The Future of Scholarly Communications. Insights, Vol. 27, 3 (November 2014), 233--238.
[12]
M. Ferrante, N. Ferro, and M. Maistro. 2015. Towards a Formal Framework for Utility-oriented Measurements of Retrieval Effectiveness. In Proc. 1st ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2015), J. Allan, W. B. Croft, A. P. de Vries, C. Zhai, N. Fuhr, and Y. Zhang (Eds.). ACM Press, New York, USA, 21--30.
[13]
N. Ferro. 2017. What Does Affect the Correlation Among Evaluation Measures? ACM Transactions on Information Systems (TOIS), Vol. 36, 2 (September 2017), 19:1--19:40.
[14]
N. Ferro, N. Fuhr, M. Maistro, T. Sakai, and I. Soboroff. 2019. Overview of CENTRE@CLEF 2019: Sequel in the Systematic Reproducibility Realm. In Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Tenth International Conference of the CLEF Association (CLEF 2019), F. Crestani, M. Braschler, J. Savoy, A. Rauber, H. Müller, D. E. Losada, G. Heinatz Bürki, L. Cappellato, and N. Ferro (Eds.). Lecture Notes in Computer Science (LNCS) 11696, Springer, Heidelberg, Germany, 287--300.
[15]
N. Ferro, N. Fuhr, and A. Rauber. 2018a. Introduction to the Special Issue on Reproducibility in Information Retrieval: Evaluation Campaigns, Collections, and Analyses. ACM Journal of Data and Information Quality (JDIQ), Vol. 10, 3 (October 2018), 9:1--9:4.
[16]
N. Ferro, N. Fuhr, and A. Rauber. 2018b. Introduction to the Special Issue on Reproducibility in Information Retrieval: Tools and Infrastructures. ACM Journal of Data and Information Quality (JDIQ), Vol. 10, 4 (November 2018), 14:1--14:4.
[17]
N. Ferro and D. Kelly. 2018. SIGIR Initiative to Implement ACM Artifact Review and Badging. SIGIR Foru, Vol. 52, 1 (June 2018), 4--10.
[18]
N. Ferro, M. Maistro, T. Sakai, and I. Soboroff. 2018c. Overview of CENTRE@CLEF 2018: a First Tale in the Systematic Reproducibility Realm. In Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Nineth International Conference of the CLEF Association (CLEF 2018), P. Bellot, C. Trabelsi, J. Mothe, F. Murtagh, J.-Y. Nie, L. Soulier, E. SanJuan, L. Cappellato, and N. Ferro (Eds.). Lecture Notes in Computer Science (LNCS) 11018, Springer, Heidelberg, Germany, 239--246.
[19]
N. Ferro and C. Peters (Eds.). 2019. Information Retrieval Evaluation in a Changing World -- Lessons Learned from 20 Years of CLEF. The Information Retrieval Series, Vol. 41. Springer International Publishing, Germany.
[20]
N. Fuhr. 2017. Some Common Mistakes In IR Evaluation, And How They Can Be Avoided. SIGIR Forum, Vol. 51, 3 (December 2017), 32--41.
[21]
N. Fuhr. 2019. Reproducibility and Validity in CLEF, See citeNzz-FerroPeters2019-editor.
[22]
E. Gibney. 2020. This AI researcher is trying to ward off a reproducibility crisis. Nature, Vol. 577 (January 2020), 14.
[23]
Maura R. Grossman and Gordon V. Cormack. 2017. MRG_UWaterloo and WaterlooCormack Participation in the TREC 2017 Common Core Track. In Proceedings of The Twenty-Sixth Text REtrieval Conference, TREC 2017, Gaithersburg, Maryland, USA, November 15-17, 2017, Ellen M. Voorhees and Angela Ellis (Eds.), Vol. Special Publication 500--324. National Institute of Standards and Technology (NIST). https://trec.nist.gov/pubs/trec26/papers/MRG_UWaterloo-CC.pdf
[24]
ISO 5725-2:2019. 2019. Accuracy (Trueness and Precision) of Measurement Methods and Results -- Part 2: Basic Method for the Determination of Repeatability and Reproducibility of a Standard Measurement method. Recommendation ISO/IEC 5725--2:2019.
[25]
M. G. Kendall. 1948. Rank correlation methods. Griffin, Oxford, England.
[26]
J. F. Kenney and E. S. Keeping. 1954. Mathematics of Statistics -- Part One 3rd ed.). D. Van Nostrand Company, Princeton, USA.
[27]
J. Lin, M. Crane, A. Trotman, J. Callan, I. Chattopadhyaya, J. Foley, G. Ingersoll, C. Macdonald, and S. Vigna. 2016. Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge. In Advances in Information Retrieval. Proc. 38th European Conference on IR Research (ECIR 2016), N. Ferro, F. Crestani, M.-F. Moens, J. Mothe, F. Silvestri, G. M. Di Nunzio, C. Hauff, and G. Silvello (Eds.). Lecture Notes in Computer Science (LNCS) 9626, Springer, Heidelberg, Germany, 357--368.
[28]
National Academies of Sciences, Engineering, and Medicine. 2016. Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results: Summary of a Workshop. The National Academies Press, Washington, USA.
[29]
National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. The National Academies Press, Washington, USA.
[30]
Open Science Collaboration. 2015. Estimating the Reproducibility of Psychological Science. Science, Vol. 349, 6251 (August 2015), 943--952.
[31]
H. E. Plesser. 2018. Reproducibility vs. Replicability: A Brief History of a Confused Terminology. Frontiers in Neuroinformatics, Vol. 11 (January 2018), 76:1--76:4.
[32]
S. Robertson and J. Callan. 2005. Routing and Filtering. In TREC: Experiment and Evaluation in Information Retrieval, E. M. Voorhees and D. K. Harman (Eds.). MIT Press, Cambridge, Massachusetts, 99--122.
[33]
T. Sakai. 2016. Two Sample T-tests for IR Evaluation: Student or Welch?. In Proc. 39th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2016), R. Perego, F. Sebastiani, J. Aslam, I. Ruthven, and J. Zobel (Eds.). ACM Press, New York, USA, 1045--1048.
[34]
T. Sakai. 2018. Laboratory Experiments in Information Retrieval. The Information Retrieval Series, Vol. 40. Springer Singapore.
[35]
T. Sakai, N. Ferro, I. Soboroff, Z. Zeng, P. Xiao, and M. Maistro. 2019. Overview of the NTCIR-14 CENTRE Task. In Proc. 14th NTCIR Conference on Evaluation of Information Access Technologies, E. Ishita, N. Kando, M. P. Kato, and Y. Liu (Eds.). National Institute of Informatics, Tokyo, Japan, 494--509.
[36]
M. Sanderson and I. Soboroff. 2007. Problems with Kendall's Tau. In Proc. 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2007), W. Kraaij, A. P. de Vries, C. L. A. Clarke, N. Fuhr, and N. Kando (Eds.). ACM Press, New York, USA, 839--840.
[37]
I. Soboroff, N. Ferro, M. Maistro, and T. Sakai. 2019. Overview of the TREC 2018 CENTRE Track. In The Twenty-Seventh Text REtrieval Conference Proceedings (TREC 2018), E. M. Voorhees and A. Ellis (Eds.). National Institute of Standards and Technology (NIST), Special Publication 500--331, Washington, USA.
[38]
Student. 1908. The Probable Error of a Mean. Biometrika, Vol. 6, 1 (March 1908), 1--25.
[39]
E. M. Voorhees. 1998. Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. In Proc. 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998), W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel (Eds.). ACM Press, New York, USA, 315--323.
[40]
W. Webber, A. Moffat, and J. Zobel. 2010. A Similarity Measure for Indefinite Rankings. ACM Transactions on Information Systems (TOIS), Vol. 4, 28 (November 2010), 20:1--20:38.
[41]
W. Webber, A. Moffat, J. Zobel, and T. Sakai. 2008. Precision-at-ten Considered Redundant, See citeNzz-SIGIR2008, 695--696.
[42]
E. Yilmaz, J. A. Aslam, and S. E. Robertson. 2008. A New Rank Correlation Coefficient for Information Retrieval, See citeNzz-SIGIR2008, 587--594.

Cited By

View all
  • (2024)Evaluation of Temporal Change in IR Test CollectionsProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672530(3-13)Online publication date: 2-Aug-2024
  • (2024)A Reproducibility Study of PLAIDProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657856(1411-1419)Online publication date: 10-Jul-2024
  • (2024)Replicability Measures for Longitudinal Information Retrieval EvaluationExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-031-71736-9_16(215-226)Online publication date: 14-Sep-2024
  • Show More Cited By

Index Terms

  1. How to Measure the Reproducibility of System-oriented IR Experiments

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2020
    2548 pages
    ISBN:9781450380164
    DOI:10.1145/3397271
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 July 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. measure
    2. replicability
    3. reproducibility

    Qualifiers

    • Research-article

    Funding Sources

    • German Research Foundation (DFG)
    • Innovationsfonden Denmark

    Conference

    SIGIR '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)37
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Evaluation of Temporal Change in IR Test CollectionsProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672530(3-13)Online publication date: 2-Aug-2024
    • (2024)A Reproducibility Study of PLAIDProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657856(1411-1419)Online publication date: 10-Jul-2024
    • (2024)Replicability Measures for Longitudinal Information Retrieval EvaluationExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-031-71736-9_16(215-226)Online publication date: 14-Sep-2024
    • (2023)Validating Synthetic Usage Data in Living Lab EnvironmentsJournal of Data and Information Quality10.1145/3623640Online publication date: 24-Sep-2023
    • (2023)A Next Basket Recommendation Reality CheckACM Transactions on Information Systems10.1145/358715341:4(1-29)Online publication date: 21-Apr-2023
    • (2023)Trustworthy AI: From Principles to PracticesACM Computing Surveys10.1145/355580355:9(1-46)Online publication date: 16-Jan-2023
    • (2023)The Information Retrieval Experiment PlatformProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591888(2826-2836)Online publication date: 19-Jul-2023
    • (2023)The Best is Yet to Come: A Reproducible Analysis of CLEF eHealth TAR ExperimentsExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-031-42448-9_2(15-20)Online publication date: 18-Sep-2023
    • (2022)Towards Reproducible Machine Learning Research in Information RetrievalProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3532686(3459-3461)Online publication date: 6-Jul-2022
    • (2022)Offline Evaluation of Ranked Lists using Parametric Estimation of PropensitiesProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3532032(622-632)Online publication date: 6-Jul-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media