Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A comparison of filtering evaluation metrics based on formal constraints

Published: 01 April 2019 Publication History

Abstract

Although document filtering is simple to define, there is a wide range of different evaluation measures that have been proposed in the literature, all of which have been subject to criticism. Our goal is to compare metrics from a formal point of view, in order to understand whether each metric is appropriate, why and when, in order to achieve a better understanding of the similarities and differences between metrics. Our formal study leads to a typology of measures for document filtering which is based on (1) a formal constraint that must be satisfied by any suitable evaluation measure, and (2) a set of three (mutually exclusive) formal properties which help to understand the fundamental differences between measures and determining which ones are more appropriate depending on the application scenario. As far as we know, this is the first in-depth study on how filtering metrics can be categorized according to their appropriateness for different scenarios. Two main findings derive from our study. First, not every measure satisfies the basic constraint; but problematic measures can be adapted using smoothing techniques that and makes them compliant with the basic constraint while preserving their original properties. Our second finding is that all metrics (except one) can be grouped in three families, each satisfying one out of three formal properties which are mutually exclusive. In cases where the application scenario is clearly defined, this classification of metrics should help choosing an adequate evaluation measure. The exception is the Reliability/Sensitivity metric pair, which does not fit into any of the three families, but has two valuable empirical properties: it is strict (i.e. a good result according to reliability/sensitivity ensures a good result according to all other metrics) and has more robustness that all other measures considered in our study.

References

[1]
Agresti, A., & Hitchcock, D. B. (2005). Bayesian inference for categorical data analysis: A survey. Technical report.
[2]
Amigó, E., Artiles, J., Gonzalo, J., Spina, D., Liu, B., & Corujo, A. (2010). WePS3 evaluation campaign: Overview of the on-line reputation management task. In 2nd Web people search evaluation workshop (WePS 2010), CLEF 2010 conference, Padova Italy.
[3]
Amigó, E., Corujo, A., Gonzalo, J., Meij, E., & de Rijke, M. (2012). Overview of RepLab 2012: Evaluating online reputation management systems. In CLEF (online working notes/labs/workshop).
[4]
Amigo, E., Fang, H., Mizzaro, S., & Zhai, C. (2017). Axiomatic thinking for information retrieval and related tasks. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’17, pp. 1419–1420, New York, 2017. ACM.
[5]
Amigó E, Gonzalo J, Artiles J, and Verdejo F A comparison of extrinsic clustering evaluation metrics based on formal constraints Information Retrieval 2009 12 4 461-486
[6]
Amigó, E., Gonzalo, J, & Verdejo, F. (2013). A generic measure for document organization tasks. In Proceedings of ACM SIGIR, pp. 643–652. ACM Press.
[7]
Amigó, E., Spina, D., & Carrillo-de-Albornoz, J. (2018). An axiomatic analysis of diversity evaluation metrics: Introducing the rank-biased utility metric. In CoRR, abs/1805.02334.
[8]
Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., & Spyropoulos, C. D. (2000). An evaluation of naive bayesian anti-spam filtering. In CoRR, cs.CL/0006013.
[9]
Bradley Andrew P The use of the area under the ROC curve in the evaluation of machine learning algorithms Pattern Recognition 1997 30 1145-1159
[10]
Busin, L., & Mizzaro, S. (2013). Axiometrics: An axiomatic approach to information retrieval effectiveness metrics. In Proceedings of the 2013 conference on the theory of information retrieval, ICTIR ’13, pp. 8:22–8:29, New York, NY, 2013. ACM.
[11]
Callan, J. (1996). Document filtering with inference networks. In Proceedings of the nineteenth annual international ACM SIGIR conference on research and development in information retrieval, pp. 262–269.
[12]
Caruana, R., & Niculescu-Mizil, A. (2005). An empirical comparison of supervised learning algorithms using different performance metrics. In Proceedings of 23rd international conference machine learning (ICML06), pp. 161–168.
[13]
Clinchant, S., & Gaussier, E. (2011). Is document frequency important for PRF? In Advances in information retrieval theory, pp. 89–100. Springer.
[14]
Cohen J A coefficient of agreement for nominal scales Educational and Psychological Measurement 1960 20 1 37
[15]
Cormack, G., & Lynam, T. (2005). TREC 2005 spam track overview. In Proceedings of the fourteenth text retrieval conference 8TREC 2005).
[16]
Fang, H. (2008). A re-examination of query expansion using lexical resources. In ACL, vol. 2008, pp. 139–147. Citeseer.
[17]
Fang, H., Tao, T., & Zhai, C. X. (2004). A formal study of information retrieval heuristics. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 49–56. ACM.
[18]
Fang, H., & Zhai, C. X. (2006). Semantic term matching in axiomatic approaches to information retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pp. 115–122. ACM.
[19]
Fang, H, & Zhai, C. X. (2014). Axiomatic analysis and optimization of information retrieval models. In Proceedings of the 37th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’14, pp. 1288–1288, New York, NY, 2014. ACM.
[20]
Fawcett T and Niculescu-Mizil A PAV and the ROC convex hull Machine Learning 2007 68 97-106
[21]
Ferri C, Hernández-Orallo J, and Modroiu R An experimental comparison of performance measures for classification Pattern Recognition Letters 2009 30 1 27-38
[22]
Good IJ Rational decisions Journal of the Royal Statistical Society Series B (Methodological) 1952 14 107-114
[23]
Hedin, B., Tomlinson, S., Baron, J. R., & Oard, D. W. (2009). Overview of the TREC 2009 legal track.
[24]
Hoashi, K., Matsumoto, K., Inoue, N., & Hashimoto, K. (2000). Document filtering method using non-relevant information profile. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’00, pp. 176–183, New York, NY, 2000. ACM.
[25]
Hull David A The TREC-6 filtering track: Description and analysis Proceedings of the TREC 1997 6 33-56
[26]
Hull, D. A. (1998). The TREC-7 filtering track: Description and analysis. In E. M. Voorhees and D. K. Harman, editors, Proceedings of TREC-7, 7th text retrieval conference, pp. 33–56, Gaithersburg, US, 1998. National Institute of Standards and Technology, Gaithersburg, US.
[27]
Karimzadehgan, M., & Zhai, C. X. (2012). Axiomatic analysis of translation language model for information retrieval. In Advances in information retrieval, pp. 268–280. Springer, Berlin.
[28]
Karon BP and Alexander IE Association and estimation in contingency tables Journal of the American Statistical Association 1958 23 2 1-28
[29]
Krishnamurthy, B., Gill, P., & Arlitt, M. (2008). A few chirps about twitter. In WOSP ’08: Proceedings of the first workshop on online social networks, pp. 19–24, New York, NY, 2008. ACM.
[30]
Le, A., Ajot, J., Przybocki, M., & Strassel, S. (2010). Document image collection using Amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk, pp. 45–52, Los Angeles, June 2010. Association for Computational Linguistics.
[31]
Ling, C. X., Huang, J., & Zhang, H. (2003). AUC: A statistically consistent and more discriminating measure than accuracy. In IJCAI, pp. 519–526.
[32]
Lv, Y., & Zhai, C. X. (2011). Lower-bounding term frequency normalization. In Proceedings of the 20th ACM international conference on information and knowledge management, CIKM ’11, pp. 7–16, New York, NY, 2011. ACM.
[33]
Mitchell TM Machine learning 1997 New York McGraw Hill
[34]
Persin Michael Document Filtering for Fast Ranking SIGIR ’94 1994 London Springer London 339-348
[35]
Provost, F. J., & Fawcett, T. (1997). Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In Knowledge discovery and data mining, pp. 43–48.
[36]
Qi, Haoliang, Yang, Muyun, He, Xiaoning, & Li, Sheng. (2010). Re-examination on lam% in spam filtering. In Proceedings of the SIGIR 2010 conference, Geneva, Switzerland.
[37]
Robertson, S., & Hull, D. A. (2001). The TREC-9 filtering track final report. In Proceedings of TREC-9, pp. 25–40.
[38]
Schapire, R. E., Singer, Y., & Singhal, A. (1998). Boosting and Rocchio applied to text filtering. In Proceedings of ACM SIGIR, pp. 215–223. ACM Press.
[39]
Sebastiani, F. (2015). An axiomatically derived measure for the evaluation of classification algorithms. In ICTIR, pp. 11–20.
[40]
Sokolova, M. (2006). Assessing invariance properties of evaluation measures. In Proceedings of NIPS’06 workshop on testing deployable learning and decision systems.
[41]
Sokolova, M., Japkowicz, N., & Szpakowicz, S. (2006). Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation. AI 2006: Advances in artificial intelligence, pp. 1015–1021.
[42]
Tversky A Features of similarity Psychological Review 1977 84 327-352
[43]
Van Rijsbergen C Foundation of evaluation Journal of Documentation 1974 30 4 365-373

Cited By

View all
  • (2024)Properties of Group Fairness Measures for RankingsACM Transactions on Social Computing10.1145/3674883Online publication date: 27-Aug-2024
  • (2020)On the nature of information access evaluation metrics: a unifying frameworkInformation Retrieval10.1007/s10791-020-09374-023:3(318-386)Online publication date: 1-Jun-2020

Index Terms

  1. A comparison of filtering evaluation metrics based on formal constraints
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Information Retrieval
        Information Retrieval  Volume 22, Issue 6
        Dec 2019
        95 pages

        Publisher

        Kluwer Academic Publishers

        United States

        Publication History

        Published: 01 April 2019
        Accepted: 20 March 2019
        Received: 14 April 2016

        Author Tags

        1. Document filtering
        2. Evaluation metrics
        3. Evaluation methodologies

        Qualifiers

        • Research-article

        Funding Sources

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 12 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Properties of Group Fairness Measures for RankingsACM Transactions on Social Computing10.1145/3674883Online publication date: 27-Aug-2024
        • (2020)On the nature of information access evaluation metrics: a unifying frameworkInformation Retrieval10.1007/s10791-020-09374-023:3(318-386)Online publication date: 1-Jun-2020

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media