research-article

A comparison of filtering evaluation metrics based on formal constraints

Authors:

Enrique Amigó,

Felisa Verdejo,

Damiano SpinaAuthors Info & Claims

Information Retrieval Journal, Volume 22, Issue 6

Pages 581 - 619

https://doi.org/10.1007/s10791-019-09355-y

Published: 01 April 2019 Publication History

Abstract

Although document filtering is simple to define, there is a wide range of different evaluation measures that have been proposed in the literature, all of which have been subject to criticism. Our goal is to compare metrics from a formal point of view, in order to understand whether each metric is appropriate, why and when, in order to achieve a better understanding of the similarities and differences between metrics. Our formal study leads to a typology of measures for document filtering which is based on (1) a formal constraint that must be satisfied by any suitable evaluation measure, and (2) a set of three (mutually exclusive) formal properties which help to understand the fundamental differences between measures and determining which ones are more appropriate depending on the application scenario. As far as we know, this is the first in-depth study on how filtering metrics can be categorized according to their appropriateness for different scenarios. Two main findings derive from our study. First, not every measure satisfies the basic constraint; but problematic measures can be adapted using smoothing techniques that and makes them compliant with the basic constraint while preserving their original properties. Our second finding is that all metrics (except one) can be grouped in three families, each satisfying one out of three formal properties which are mutually exclusive. In cases where the application scenario is clearly defined, this classification of metrics should help choosing an adequate evaluation measure. The exception is the Reliability/Sensitivity metric pair, which does not fit into any of the three families, but has two valuable empirical properties: it is strict (i.e. a good result according to reliability/sensitivity ensures a good result according to all other metrics) and has more robustness that all other measures considered in our study.

References

[1]

Agresti, A., & Hitchcock, D. B. (2005). Bayesian inference for categorical data analysis: A survey. Technical report.

[2]

Amigó, E., Artiles, J., Gonzalo, J., Spina, D., Liu, B., & Corujo, A. (2010). WePS3 evaluation campaign: Overview of the on-line reputation management task. In 2nd Web people search evaluation workshop (WePS 2010), CLEF 2010 conference, Padova Italy.

[3]

Amigó, E., Corujo, A., Gonzalo, J., Meij, E., & de Rijke, M. (2012). Overview of RepLab 2012: Evaluating online reputation management systems. In CLEF (online working notes/labs/workshop).

[4]

Amigo, E., Fang, H., Mizzaro, S., & Zhai, C. (2017). Axiomatic thinking for information retrieval and related tasks. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’17, pp. 1419–1420, New York, 2017. ACM.

[5]

Amigó E, Gonzalo J, Artiles J, and Verdejo F A comparison of extrinsic clustering evaluation metrics based on formal constraints Information Retrieval 2009 12 4 461-486

[6]

Amigó, E., Gonzalo, J, & Verdejo, F. (2013). A generic measure for document organization tasks. In Proceedings of ACM SIGIR, pp. 643–652. ACM Press.

[7]

Amigó, E., Spina, D., & Carrillo-de-Albornoz, J. (2018). An axiomatic analysis of diversity evaluation metrics: Introducing the rank-biased utility metric. In CoRR, abs/1805.02334.

[8]

Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., & Spyropoulos, C. D. (2000). An evaluation of naive bayesian anti-spam filtering. In CoRR, cs.CL/0006013.

[9]

Bradley Andrew P The use of the area under the ROC curve in the evaluation of machine learning algorithms Pattern Recognition 1997 30 1145-1159

[10]

Busin, L., & Mizzaro, S. (2013). Axiometrics: An axiomatic approach to information retrieval effectiveness metrics. In Proceedings of the 2013 conference on the theory of information retrieval, ICTIR ’13, pp. 8:22–8:29, New York, NY, 2013. ACM.

[11]

Callan, J. (1996). Document filtering with inference networks. In Proceedings of the nineteenth annual international ACM SIGIR conference on research and development in information retrieval, pp. 262–269.

[12]

Caruana, R., & Niculescu-Mizil, A. (2005). An empirical comparison of supervised learning algorithms using different performance metrics. In Proceedings of 23rd international conference machine learning (ICML06), pp. 161–168.

[13]

Clinchant, S., & Gaussier, E. (2011). Is document frequency important for PRF? In Advances in information retrieval theory, pp. 89–100. Springer.

[14]

Cohen J A coefficient of agreement for nominal scales Educational and Psychological Measurement 1960 20 1 37

[15]

Cormack, G., & Lynam, T. (2005). TREC 2005 spam track overview. In Proceedings of the fourteenth text retrieval conference 8TREC 2005).

[16]

Fang, H. (2008). A re-examination of query expansion using lexical resources. In ACL, vol. 2008, pp. 139–147. Citeseer.

[17]

Fang, H., Tao, T., & Zhai, C. X. (2004). A formal study of information retrieval heuristics. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 49–56. ACM.

[18]

Fang, H., & Zhai, C. X. (2006). Semantic term matching in axiomatic approaches to information retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pp. 115–122. ACM.

[19]

Fang, H, & Zhai, C. X. (2014). Axiomatic analysis and optimization of information retrieval models. In Proceedings of the 37th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’14, pp. 1288–1288, New York, NY, 2014. ACM.

[20]

Fawcett T and Niculescu-Mizil A PAV and the ROC convex hull Machine Learning 2007 68 97-106

[21]

Ferri C, Hernández-Orallo J, and Modroiu R An experimental comparison of performance measures for classification Pattern Recognition Letters 2009 30 1 27-38

[22]

Good IJ Rational decisions Journal of the Royal Statistical Society Series B (Methodological) 1952 14 107-114

[23]

Hedin, B., Tomlinson, S., Baron, J. R., & Oard, D. W. (2009). Overview of the TREC 2009 legal track.

[24]

Hoashi, K., Matsumoto, K., Inoue, N., & Hashimoto, K. (2000). Document filtering method using non-relevant information profile. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’00, pp. 176–183, New York, NY, 2000. ACM.

[25]

Hull David A The TREC-6 filtering track: Description and analysis Proceedings of the TREC 1997 6 33-56

[26]

Hull, D. A. (1998). The TREC-7 filtering track: Description and analysis. In E. M. Voorhees and D. K. Harman, editors, Proceedings of TREC-7, 7th text retrieval conference, pp. 33–56, Gaithersburg, US, 1998. National Institute of Standards and Technology, Gaithersburg, US.

[27]

Karimzadehgan, M., & Zhai, C. X. (2012). Axiomatic analysis of translation language model for information retrieval. In Advances in information retrieval, pp. 268–280. Springer, Berlin.

[28]

Karon BP and Alexander IE Association and estimation in contingency tables Journal of the American Statistical Association 1958 23 2 1-28

[29]

Krishnamurthy, B., Gill, P., & Arlitt, M. (2008). A few chirps about twitter. In WOSP ’08: Proceedings of the first workshop on online social networks, pp. 19–24, New York, NY, 2008. ACM.

[30]

Le, A., Ajot, J., Przybocki, M., & Strassel, S. (2010). Document image collection using Amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk, pp. 45–52, Los Angeles, June 2010. Association for Computational Linguistics.

[31]

Ling, C. X., Huang, J., & Zhang, H. (2003). AUC: A statistically consistent and more discriminating measure than accuracy. In IJCAI, pp. 519–526.

[32]

Lv, Y., & Zhai, C. X. (2011). Lower-bounding term frequency normalization. In Proceedings of the 20th ACM international conference on information and knowledge management, CIKM ’11, pp. 7–16, New York, NY, 2011. ACM.

[33]

Mitchell TM Machine learning 1997 New York McGraw Hill

[34]

Persin Michael Document Filtering for Fast Ranking SIGIR ’94 1994 London Springer London 339-348

[35]

Provost, F. J., & Fawcett, T. (1997). Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In Knowledge discovery and data mining, pp. 43–48.

[36]

Qi, Haoliang, Yang, Muyun, He, Xiaoning, & Li, Sheng. (2010). Re-examination on lam% in spam filtering. In Proceedings of the SIGIR 2010 conference, Geneva, Switzerland.

[37]

Robertson, S., & Hull, D. A. (2001). The TREC-9 filtering track final report. In Proceedings of TREC-9, pp. 25–40.

[38]

Schapire, R. E., Singer, Y., & Singhal, A. (1998). Boosting and Rocchio applied to text filtering. In Proceedings of ACM SIGIR, pp. 215–223. ACM Press.

[39]

Sebastiani, F. (2015). An axiomatically derived measure for the evaluation of classification algorithms. In ICTIR, pp. 11–20.

[40]

Sokolova, M. (2006). Assessing invariance properties of evaluation measures. In Proceedings of NIPS’06 workshop on testing deployable learning and decision systems.

[41]

Sokolova, M., Japkowicz, N., & Szpakowicz, S. (2006). Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation. AI 2006: Advances in artificial intelligence, pp. 1015–1021.

[42]

Tversky A Features of similarity Psychological Review 1977 84 327-352

[43]

Van Rijsbergen C Foundation of evaluation Journal of Documentation 1974 30 4 365-373

Cited By

Schumacher TLutz MSikdar SStrohmaier M(2024)Properties of Group Fairness Measures for RankingsACM Transactions on Social Computing10.1145/3674883Online publication date: 27-Aug-2024
https://dl.acm.org/doi/10.1145/3674883
Amigó EMizzaro S(2020)On the nature of information access evaluation metrics: a unifying frameworkInformation Retrieval10.1007/s10791-020-09374-023:3(318-386)Online publication date: 1-Jun-2020
https://dl.acm.org/doi/10.1007/s10791-020-09374-0

Index Terms

A comparison of filtering evaluation metrics based on formal constraints
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
    2. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Index terms have been assigned to the content through auto-classification.

Recommendations

A comparison of extrinsic clustering evaluation metrics based on formal constraints
Abstract
There is a wide set of evaluation metrics available to compare the quality of text clustering algorithms. In this article, we define a few intuitive formal constraints on such metrics which shed light on which aspects of the quality of a ...
Dependencies Between Modularity Metrics Towards Improved Modules
Knowledge Engineering and Knowledge Management
Abstract
Recent years have seen many advances in ontology modularisation. This has made it difficult to determine whether a module is actually a good module; it is unclear which metrics should be considered. The few existing works on evaluation metrics ...
Comparison of metrics for the evaluation of medical segmentations using prostate MRI dataset
Abstract
Nine previously proposed segmentation evaluation metrics, targeting medical relevance, accounting for holes, and added regions or differentiating over- and under-segmentation, were compared with 24 traditional metrics to identify those ...
Graphical abstract

Display Omitted
Highlights
- Interclass correlation correlated best with visual assessment among the 33 metrics.

Comments

Information & Contributors

Information

Published In

cover image Information Retrieval

Information Retrieval Volume 22, Issue 6

Dec 2019

95 pages

ISSN:1386-4564

Issue’s Table of Contents

© Springer Nature B.V. 2019.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 April 2019

Accepted: 20 March 2019

Received: 14 April 2016

Author Tags

Qualifiers

Research-article

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Schumacher TLutz MSikdar SStrohmaier M(2024)Properties of Group Fairness Measures for RankingsACM Transactions on Social Computing10.1145/3674883Online publication date: 27-Aug-2024
https://dl.acm.org/doi/10.1145/3674883
Amigó EMizzaro S(2020)On the nature of information access evaluation metrics: a unifying frameworkInformation Retrieval10.1007/s10791-020-09374-023:3(318-386)Online publication date: 1-Jun-2020
https://dl.acm.org/doi/10.1007/s10791-020-09374-0

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents