Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Approximate Recall Confidence Intervals

Published: 01 January 2013 Publication History
  • Get Citation Alerts
  • Abstract

    Recall, the proportion of relevant documents retrieved, is an important measure of effectiveness in information retrieval, particularly in the legal, patent, and medical domains. Where document sets are too large for exhaustive relevance assessment, recall can be estimated by assessing a random sample of documents, but an indication of the reliability of this estimate is also required. In this article, we examine several methods for estimating two-tailed recall confidence intervals. We find that the normal approximation in current use provides poor coverage in many circumstances, even when adjusted to correct its inappropriate symmetry. Analytic and Bayesian methods based on the ratio of binomials are generally more accurate but are inaccurate on small populations. The method we recommend derives beta-binomial posteriors on retrieved and unretrieved yield, with fixed hyperparameters, and a Monte Carlo estimate of the posterior distribution of recall. We demonstrate that this method gives mean coverage at or near the nominal level, across several scenarios, while being balanced and stable. We offer advice on sampling design, including the allocation of assessments to the retrieved and unretrieved segments, and compare the proposed beta-binomial with the officially reported normal intervals for recent TREC Legal Track iterations.

    References

    [1]
    Agresti, A. and Caffo, B. 2000. Simple and effective confidence intervals for proportions and differences of proportions result from adding two successes and two failures. Am. Statistician 54, 4, 280--288.
    [2]
    Agresti, A. and Coull, B. A. 1998. Approximate is better than “exact” for interval estimation of binomial proportions. Am. Statistician 52, 2, 119--126.
    [3]
    Al-Jararha, J. 2008. Unbiased ratio estimation for finite populations. Ph.D. dissertion, Colorado State University, Fort Collins, CO.
    [4]
    Aslam, J. and Pavlu, V. 2008. A practical sampling strategy for efficient retrieval evaluation. Tech. rep., Northeastern University, Boston, MA.
    [5]
    Aslam, J., Pavlu, V., and Yilmaz, E. 2006. A statistical method for system evaluation using incomplete judgments. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, S. Dumais, E. Efthimiadis, D. Hawking, and K. Järvelin Eds. 541--548.
    [6]
    Berger, J. O., Bernardo, J. M., and Sun, D. 2008. Objective priors for discrete parameter spaces. Tech. rep., Duke University, Durham, NC.
    [7]
    Bolstad, W. M. 2007. Introduction to Bayesian Statistics. John Wiley & Sons, Hoboken, NJ.
    [8]
    Brown, L. D., Cai, T. T., and DasGupta, A. 2001. Interval estimation for a binomial proportion. Stat. Sci. 18, 2, 101--133.
    [9]
    Buckland, S. T. 1984. Monte Carlo confidence intervals. Biometrics 40, 3, 811--817.
    [10]
    Cai, T. T. 2005. One-sided confidence intervals in discrete distributions. J. Stat. Plan. Inference 131, 1, 63--88.
    [11]
    Carterette, B. 2007. Robust test collections for retrieval evaluation. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, C. L. A. Clarke, N. Fuhr, N. Kando, W. Kraaij, and A. de Vries Eds. 55--62.
    [12]
    Chen, M.-H. and Shao, Q.-M. 1999. Monte Carlo estimation of Bayesian credible and HPD intervals. J. Comput. Graph. Stat. 8, 1, 69--92.
    [13]
    Cheng, R. C. H. 1978. Generating beta variates with nonintegral shape parameters. Commun. ACM 21, 4, 317--322.
    [14]
    Clopper, C. J. and Pearson, E. S. 1934. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 26, 4, 404--413.
    [15]
    Cochran, W. G. 1977. Sampling Techniques 3rd Ed. John Wiley & Sons, Hoboken, NJ.
    [16]
    Dutka, J. 1984. The early history of the hypergeometric function. Arch. Hist. Exact Sci. 31, 1, 15--34.
    [17]
    Dyer, D. and Chiou, P. 1984. An information-theoretic approach to incorporating prior information in binomial sampling. Commun. Stat. Theory Methods 13, 17, 2051--2083.
    [18]
    Dyer, D. and Pierce, R. L. 1993. On the choice of the prior distribution in hypergeometric sampling. Commun. Stat. Theory Methods 22, 8, 2125--2146.
    [19]
    Feller, W. 1945. On the normal approximation to the binomial distribution. Ann. Math. Stat. 16, 4, 319--329.
    [20]
    Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. 2004. Bayesian Data Analysis 2nd Ed. Chapman and Hall/CRC, London.
    [21]
    Greenland, S. Agresti, A., and Caffo, B. 2001. Simple and effective confidence intervals for proportions and differences of proportions result from adding two successes and two failures. Amer. Statistician, 54, 280--288: Comment by Greenland and reply. Amer. Statistician 55, 2, 172.
    [22]
    Guenther, W. C. 1971. Unbiased confidence intervals. Amer. Statistician 25, 1, 51--53.
    [23]
    Hartley, H. O. and Ross, A. 1954. Unbiased ratio estimators. Nature 174, 270--271.
    [24]
    Hedin, B., Tomlinson, S., Baron, J. R., and Oard, D. W. 2009. Overview of the TREC 2009 legal track. In Proceedings of the 18th Text REtrieval Conference, E. Voorhees and L. P. Buckland Eds. 1:4:1--40. NIST Special Publication 500-278.
    [25]
    Jeffreys, H. 1946. An invariant form for the prior probability in estimation problems. In Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences 186, 1007, 453--461.
    [26]
    Kantor, P., Kim, M.-H., Ibraev, U., and Atasoy, K. 1999. Estimating the number of relevant documents in enormous collections. In Proceedings of the ASIS Annual Meeting. 507--514.
    [27]
    Koopman, P. A. R. 1984. Confidence intervals for the ratio of two binomial proportions. Biometrics 40, 2, 513--517.
    [28]
    Lehmann, E. L. and Romano, J. P. 2005. Testing Statistical Hypotheses 3rd Ed. Springer, Berlin.
    [29]
    Liu, Y. K. and Kott, P. S. 2009. Evaluating alternative one-sided coverage intervals for a proportion. J. Off. Stat. 25, 4, 569--588.
    [30]
    Newcombe, R. G. 2001. Logit confidence intervals and the inverse sinh transformation. Amer. Statistician 55, 3, 200--202.
    [31]
    Neyman, J. 1935. On the problem of confidence intervals. Ann. Math. Stat. 6, 3, 111--116.
    [32]
    Nicholson, W. L. 1956. On the normal approximation to the hypergeometric distribution. Ann. Math. Stat. 27, 2, 471--483.
    [33]
    Oard, D. W., Hedin, B., Tomlinson, S., and Baron, J. R. 2008. Overview of the TREC 2008 legal track. In Proceedings of the 17th Text REtrieval Conference, E. Voorhees and L. P. Buckland Eds. 3:1--45. NIST Special Publication 500-277.
    [34]
    Pavlu, V. 2008. Large scale IR evaluation. Ph.D. dissertation, Northeastern University, Boston MA.
    [35]
    Roitblat, H. L., Kershaw, A., and Oot, P. 2010. Document categorization in legal electronic discovery: Computer Classification vs. manual review. J. Amer. Soc. Inf. Sci. Technol. 61, 1, 70--80.
    [36]
    Särndal, C.-E., Swensson, B., and Wretman, J. 1992. Model Assisted Survey Sampling. Springer-Verlag, Berlin.
    [37]
    Simel, D. L., Samsa, G. P., and Matchar, D. B. 1991. Likelihood ratios with confidence: Sample size estimation for diagnostic test studies. J. Clin. Epidemiol. 44, 8, 763--770.
    [38]
    Smithson, M. 2002. Confidence Intervals. Sage Publications, Thousand Oaks, CA.
    [39]
    Sunter, A. B. 1977. List sequential sampling with equal or unequal probabilities without replacement. J. R. Stat. Soc. Series C (Applied Statistics) 26, 3, 261--268.
    [40]
    Taylor, J. R. 1997. Introduction to Error Analysis 2nd Ed. University Science Books.
    [41]
    Thompson, S. K. 2002. Sampling 2nd Ed. John Wiley & Sons, New York, NY.
    [42]
    Tomlinson, S., Oard, D. W., Baron, J. R., and Thompson, P. 2007. Overview of the TREC 2007 legal track. In Proceedings of the 16th Text REtrieval Conference, E. Voorhees and L. P. Buckland Eds. 5:1--34. NIST Special Publication 500--274.
    [43]
    Webber, W., Oard, D. W., Scholer, F., and Hedin, B. 2010. Assessor error in stratified evaluation. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. 539--548.
    [44]
    Wetherill, G. B. and Glazebrook, K. D. 1986. Sequential Methods in Statistics 3rd Ed. Chapman and Hall, London.
    [45]
    Zobel, J. 1998. How reliable are the results of large-scale information retrieval experiments? In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel Eds. Melbourne, Australia, 307--314.

    Cited By

    View all
    • (2021)Computer-Assisted Cohort Identification in PracticeACM Transactions on Computing for Healthcare10.1145/34834113:2(1-28)Online publication date: 20-Dec-2021
    • (2021)Heuristic stopping rules for technology-assisted reviewProceedings of the 21st ACM Symposium on Document Engineering10.1145/3469096.3469873(1-10)Online publication date: 16-Aug-2021
    • (2020)Definition and Application of a Protocol for Electronic Nose Field Performance Testing: Example of Odor Monitoring from a Tire Storage AreaAtmosphere10.3390/atmos1104042611:4(426)Online publication date: 23-Apr-2020
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 31, Issue 1
    January 2013
    163 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/2414782
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 January 2013
    Accepted: 01 October 2012
    Revised: 01 August 2012
    Received: 01 May 2012
    Published in TOIS Volume 31, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Posterior distributions
    2. probabilistic models

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Computer-Assisted Cohort Identification in PracticeACM Transactions on Computing for Healthcare10.1145/34834113:2(1-28)Online publication date: 20-Dec-2021
    • (2021)Heuristic stopping rules for technology-assisted reviewProceedings of the 21st ACM Symposium on Document Engineering10.1145/3469096.3469873(1-10)Online publication date: 16-Aug-2021
    • (2020)Definition and Application of a Protocol for Electronic Nose Field Performance Testing: Example of Odor Monitoring from a Tire Storage AreaAtmosphere10.3390/atmos1104042611:4(426)Online publication date: 23-Apr-2020
    • (2017)Measuring Effectiveness in the TREC Legal TrackCurrent Challenges in Patent Information Retrieval10.1007/978-3-662-53817-3_6(163-182)Online publication date: 26-Mar-2017
    • (2016)Scalability of Continuous Active Learning for Reliable High-Recall Text ClassificationProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983776(1039-1048)Online publication date: 24-Oct-2016
    • (2015)Evaluating expertise and sample bias effects for privilege classification in e-discoveryProceedings of the 15th International Conference on Artificial Intelligence and Law10.1145/2746090.2746101(119-127)Online publication date: 8-Jun-2015
    • (2014)Assessing the reliability and reusability of an E-discovery privilege test collectionProceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval10.1145/2600428.2609506(1047-1050)Online publication date: 3-Jul-2014
    • (2013)Towards minimizing the annotation cost of certified text classificationProceedings of the 22nd ACM international conference on Information & Knowledge Management10.1145/2505515.2505708(989-998)Online publication date: 27-Oct-2013
    • (2013)Sequential testing in classifier evaluation yields biased estimates of effectivenessProceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval10.1145/2484028.2484159(933-936)Online publication date: 28-Jul-2013

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media