Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3664190.3672532acmconferencesArticle/Chapter ViewAbstractPublication PagesictirConference Proceedingsconference-collections
research-article
Open access

Normalised Precision at Fixed Recall for Evaluating TAR

Published: 05 August 2024 Publication History

Abstract

A popular approach to High-Recall Information Retrieval (HRIR) is Technology-Assisted Review (TAR), which uses information retrieval and machine learning techniques to aid the review of large document collections. TAR systems are commonly used in legal eDiscovery and medical systematic literature reviews. Successful TAR systems are able to find the majority of relevant documents using the least number of manual assessments. Previous work typically evaluated TAR models retrospectively, assuming that the system achieves a specific, fixed Recall level first and then measuring model quality (for instance, work saved at r% Recall).
This paper presents an analysis of one of such measures: Precision at r% Recall (P@r%). We show that minimum Precision at r% scores depends on the dataset, and therefore, this measure should not be used for evaluation across topics or datasets. We propose its min-max normalised version (nP@r%), and show that it is equal to a product of TNR and Precision scores. Our analysis shows that nP@r% is least correlated with the percentage of relevant documents in the dataset and can be used to focus on additional aspects of the TAR tasks that are not captured with current measures. Finally, we introduce a variation of nP@r%, that is a geometric mean of TNR and Precision, preserving the properties of nP@r% and having a lower coefficient of variation.

References

[1]
Amal Alharbi and Mark Stevenson. 2019. A dataset of systematic review updates. SIGIR 2019 - Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (7 2019), 1257--1260. https://doi.org/10.1145/3331184.3331358
[2]
Jason R Baron, David D Lewis, and Douglas W Oard. 2006. TREC 2006 Legal Track Overview. In TREC. Citeseer.
[3]
Dimitris Bertsimas, Jack Dunn, Colin Pawlowski, John Silberholz, Alexander Weinstein, Ying Daisy Zhuo, Eddy Chen, and Aymen A Elfiky. 2018. Applied informatics decision support tool for mortality predictions in patients with cancer. JCO clinical cancer informatics, Vol. 2 (2018), 1--11.
[4]
Philip Brownridge, Stephen W Holman, Simon J Gaskell, Christopher M Grant, Victoria M Harman, Simon J Hubbard, Karin Lanthaler, Craig Lawless, Ronan O'cualain, Paul Sims, et al. 2011. Global absolute quantification of a proteome: Challenges in the deployment of a QconCAT strategy. Proteomics, Vol. 11, 15 (2011), 2957--2970.
[5]
Luca Busin and Stefano Mizzaro. 2013. Axiometrics: An axiomatic approach to information retrieval effectiveness metrics. In Proceedings of the 2013 Conference on the Theory of Information Retrieval. 22--29.
[6]
Junya Chen, Matthew Engelhard, Ricardo Henao, Samuel Berchuck, Brian Eichner, Eliana M Perrin, Guillermo Sapiro, and Geraldine Dawson. 2023. Enhancing early autism prediction based on electronic records using clinical narratives. Journal of Biomedical Informatics (2023), 104390.
[7]
Benjamin Clavié, Alexandru Ciceu, Frederick Naylor, Guillaume Soulié, and Thomas Brightwell. 2023. Large Language Models in the Workplace: A Case Study on Prompt Engineering for Job Type Classification. In Natural Language Processing and Information Systems, Elisabeth Métais, Farid Meziane, Vijayan Sugumaran, Warren Manning, and Stephan Reiff-Marganiec (Eds.). Springer Nature Switzerland, Cham, 3--17.
[8]
Aaron M Cohen, Kyle Ambert, and Marian McDonagh. 2010. A prospective evaluation of an automated classification system to support evidence-based medicine and systematic review. In AMIA annual symposium proceedings, Vol. 2010. American Medical Informatics Association, 121.
[9]
A. M. Cohen, W. R. Hersh, K. Peterson, and Po Yin Yen. 2006. Reducing workload in systematic review preparation using automated citation classification. Journal of the American Medical Informatics Association, Vol. 13, 2 (3 2006), 206--219. https://doi.org/10.1197/jamia.M1929
[10]
Gordon V. Cormack and Maura R. Grossman. 2016. Engineering quality and reliability in technology-assisted review. SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (7 2016), 75--84. https://doi.org/10.1145/2911451.2911510
[11]
Gordon V Cormack, Maura R Grossman, Bruce Hedin, and Douglas W Oard. 2010. Overview of the TREC 2010 Legal Track. In TREC.
[12]
Siddhartha R Dalal, Paul G Shekelle, Susanne Hempel, Sydne J Newberry, Aneesa Motala, and Kanaka D Shetty. 2013. A pilot study using machine learning and domain knowledge to facilitate comparative effectiveness review updating. Medical Decision Making, Vol. 33, 3 (2013), 343--355.
[13]
Michael PA Davies, Takahiro Sato, Haitham Ashoor, Liping Hou, Triantafillos Liloglou, Robert Yang, and John K Field. 2023. Plasma protein biomarkers for early prediction of lung cancer. eBioMedicine, Vol. 93 (2023), 104686.
[14]
David Dowling. 2020. Tarpits: The Sticky Consequences of Poorly Implementing Technology-Assisted Review. Berkeley Tech. LJ, Vol. 35 (2020), 171.
[15]
Edward B Fowlkes and Colin L Mallows. 1983. A method for comparing two hierarchical clusterings. Journal of the American statistical association, Vol. 78, 383 (1983), 553--569.
[16]
Maura R Grossman, Gordon V Cormack, and Adam Roegiest. 2016. TREC 2016 Total Recall Track Overview. In TREC.
[17]
Stephen Hausler, Sourav Garg, Punarjay Chakravarty, Shubham Shrivastava, Ankit Vora, and Michael Milford. 2023. DisPlacing Objects: Improving Dynamic Vehicle Detection via Visual Place Recognition under Adverse Conditions. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 1373--1380.
[18]
Zhimin Huo, Maryellen L Giger, and Carl J Vyborny. 2001. Computerized analysis of multiple-mammographic views: Potential usefulness of special view mammograms in computer-aided diagnosis. IEEE Transactions on Medical Imaging, Vol. 20, 12 (2001), 1285--1292.
[19]
Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, Vol. 20, 4 (10 2002), 422--446. https://doi.org/10.1145/582415.582418
[20]
Evangelos Kanoulas, Dan Li, Leif Azzopardi, and Rene Spijker. 2017. CLEF 2017 technologically assisted reviews in empirical medicine overview. CEUR Workshop Proceedings, Vol. 1866 (9 2017), 1--29. https://pureportal.strath.ac.uk/en/publications/clef-2017-technologically-assisted-reviews-in-empirical-medicine-
[21]
Evangelos Kanoulas, Dan Li, Leif Azzopardi, and Rene Spijker. 2018. CLEF 2018 technologically assisted reviews in empirical medicine overview. CEUR Workshop Proceedings, Vol. 2125 (7 2018). https://pureportal.strath.ac.uk/en/publications/clef-2018-technologically-assisted-reviews-in-empirical-medicine-
[22]
E. Kanoulas, Dan Li, Leif Azzopardi, and René Spijker. 2019. CLEF 2019 Technology Assisted Reviews in Empirical Medicine Overview. In CLEF.
[23]
Georgios Kontonatsios, Sally Spencer, Peter Matthew, and Ioannis Korkontzelos. 2020. Using a neural network-based feature extraction method to facilitate citation screening for systematic reviews. Expert Systems with Applications: X, Vol. 6 (7 2020), 100030. https://doi.org/10.1016/j.eswax.2020.100030
[24]
Wojciech Kusa, Allan Hanbury, and Petr Knoth. 2022. Automation of Citation Screening for Systematic Literature Reviews Using Neural Networks: A Replicability Study. In Advances in Information Retrieval, Matthias Hagen, Suzan Verberne, Craig Macdonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, and Vinay Setty (Eds.). Springer International Publishing, Cham, 584--598. https://doi.org/10.1007/978--3-030--99736--6_39
[25]
Wojciech Kusa, Aldo Lipani, Petr Knoth, and Allan Hanbury. 2023. An Analysis of Work Saved over Sampling in the Evaluation of Automated Citation Screening in Systematic Literature Reviews. Intelligent Systems with Applications, Vol. 18 (2023), 200193. https://doi.org/10.1016/j.iswa.2023.200193
[26]
Wojciech Kusa, Aldo Lipani, Petr Knoth, and Allan Hanbury. 2023. Vombat: A tool for visualising evaluation measure behaviour in high-recall search tasks. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3105--3109.
[27]
Wojciech Kusa, Óscar E. Mendoza, Matthias Samwald, Petr Knoth, and Allan Hanbury. 2023. CSMeD: Bridging the Dataset Gap in Automated Citation Screening for Systematic Literature Reviews. In 37th Conference on Neural Information Processing Systems Track on Datasets and Benchmarks.
[28]
Dawn Lawrie, James Mayfield, Douglas W Oard, and Eugene Yang. 2022. HC4: A new suite of test collections for ad hoc CLIR. In Advances in Information Retrieval: 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10-14, 2022, Proceedings, Part I. Springer, 351--366.
[29]
Yan Li, Shaobin Cheng, Lu Zhang, Bing Xie, and Jiasu Sun. 2007. Mining user query logs to refine component description. In 31st Annual International Computer Software and Applications Conference (COMPSAC 2007), Vol. 1. IEEE, 71--78.
[30]
Dirk Meijer, Lisa Scholten, Francois Clemens, and Arno Knobbe. 2019. A defect classification methodology for sewer image sets with convolutional neural networks. Automation in Construction, Vol. 104 (2019), 281--298.
[31]
Douglas W Oard, Bruce Hedin, Stephen Tomlinson, and Jason R Baron. 2008. Overview of the TREC 2008 legal track. Technical Report. MARYLAND UNIV COLLEGE PARK COLL OF INFORMATION STUDIES.
[32]
Alison O'Mara-Eves, James Thomas, John McNaught, Makoto Miwa, and Sophia Ananiadou. 2015. Using text mining for study identification in systematic reviews: A systematic review of current approaches. Systematic Reviews, Vol. 4, 1 (1 2015), 5. https://doi.org/10.1186/2046-4053-4-5
[33]
Benjamin Piwowarski, Patrick Gallinari, and Georges Dupret. 2007. Precision recall with user modeling (PRUM) Application to structured information retrieval. ACM Transactions on Information Systems (TOIS), Vol. 25, 1 (2007), 1--es.
[34]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference (8 2019), 3982--3992. https://arxiv.org/abs/1908.10084v1
[35]
Harrisen Scells, Guido Zuccon, Bevan Koopman, Anthony Deacon, Leif Azzopardi, and Shlomo Geva. 2017. A test collection for evaluating retrieval of studies for inclusion in systematic reviews. SIGIR 2017 - Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (8 2017), 1237--1240. https://doi.org/10.1145/3077136.3080707
[36]
Stephen Tomlinson, Douglas W Oard, Jason R Baron, and Paul Thompson. 2007. Overview of the TREC 2007 Legal Track. In TREC.
[37]
Raymon van Dinter, Bedir Tekinerdogan, and Cagatay Catal. 2021. Automation of systematic literature reviews: A systematic literature review. Information and Software Technology, Vol. 136 (8 2021), 106589. https://doi.org/10.1016/j.infsof.2021.106589
[38]
Byron C. Wallace, Kevin Small, Carla E. Brodley, and Thomas A. Trikalinos. 2010. Active learning for biomedical citation screening. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2010), 173--181. https://doi.org/10.1145/1835804.1835829
[39]
Byron C Wallace, Thomas A Trikalinos, Joseph Lau, Carla Brodley, and Christopher H Schmid. 2010. Semi-automated screening of biomedical citations for systematic reviews. BMC bioinformatics, Vol. 11, 1 (2010), 1--11.
[40]
Eugene Yang and David D. Lewis. 2022. TARexp: A Python Framework for Technology-Assisted Review Experiments. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (Madrid, Spain) (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 3256--3261. https://doi.org/10.1145/3477495.3531663
[41]
Eugene Yang, David D Lewis, and Ophir Frieder. 2021. On minimizing cost in legal document review workflows. In Proceedings of the 21st ACM symposium on document engineering. 1--10.
[42]
Eugene Yang, Sean MacAvaney, David D. Lewis, and Ophir Frieder. 2022. Goldilocks: Just-Right Tuning of BERT for Technology-Assisted Review. In Advances in Information Retrieval, Matthias Hagen, Suzan Verberne, Craig Macdonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, and Vinay Setty (Eds.). Springer International Publishing, Cham, 502--517.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICTIR '24: Proceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval
August 2024
267 pages
ISBN:9798400706813
DOI:10.1145/3664190
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 August 2024

Check for updates

Author Tags

  1. citation screening
  2. evaluation
  3. high-recall retrieval
  4. normalised precision
  5. precision at recall
  6. systematic reviews
  7. tar

Qualifiers

  • Research-article

Funding Sources

Conference

ICTIR '24
Sponsor:

Acceptance Rates

ICTIR '24 Paper Acceptance Rate 26 of 45 submissions, 58%;
Overall Acceptance Rate 235 of 527 submissions, 45%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 135
    Total Downloads
  • Downloads (Last 12 months)135
  • Downloads (Last 6 weeks)77
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media