Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

How to design the fair experimental classifier evaluation

Published: 01 June 2021 Publication History

Abstract

Many researchers working on classification problems evaluate the quality of developed algorithms based on computer experiments. The conclusions drawn from them are usually supported by the statistical analysis and chosen experimental protocol. Statistical tests are widely used to confirm whether considered methods significantly outperform reference classifiers. Usually, the tests are applied to stratified datasets, which could raise the question of whether data folds used for classification are really randomly drawn and how the statistical analysis supports robust conclusions. Unfortunately, some scientists do not realize the real meaning of the obtained results and overinterpret them. They do not see that inappropriate use of such analytical tools may lead them into a trap. This paper aims to show the commonly used experimental protocols’ weaknesses and discuss if we really can trust in such evaluation methodology, if all presented evaluations are fair and if it is possible to manipulate the experimental results using well-known statistical evaluation methods. We will present that it is possible to choose only such results, confirming the experimenter’s expectation. We will try to show what could be done to avoid such likely unethical behavior. At the end of this work, we will formulate recommendations on improving an experimental protocol to design fair experimental classifier evaluation.

Highlights

Presenting the weakness of the commonly used experimental protocols.
Discussing if all reported evaluations are always fair.
Demonstrating how to manipulate the experimental results using well-known statistical evaluation methods.
Showing the possibility of choosing only such results which can confirm the expectation of the experimenter.
Recommending how to design fair experimental classifier evaluation to avoid likely unethical behavior.

References

[1]
Bishop C.M., Pattern Recognition and Machine Learning, Springer, 2006.
[2]
D.H. Wolpert, The supervised learning no-free-lunch Theorems, in: Proc. 6th Online World Conference on Soft Computing in Industrial Applications, 2001, pp. 25–42.
[3]
Demšar J., Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30.
[4]
García S., Herrera F., An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons, J. Mach. Learn. Res. 9 (2009) 2677–2694.
[5]
García S., Fernández A., Luengo J., Herrera F., Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power, Inform. Sci. 180 (10) (2010) 2044–2064.
[6]
Fanelli D., How many scientists fabricate and falsify research? A systematic review and meta-analysis of survey data, PLoS One 4 (5) (2009) 1–11.
[7]
Duda R.O., Hart P.E., Stork D.G., Pattern Classification, second ed., Wiley, New York, 2001.
[8]
Devijver P., Kittler J., Pattern Recognition: A Statistical Approach, Prentice/Hall International, 1982.
[9]
Alpaydin E., Introduction to Machine Learning, fourth ed., The MIT Press, 2014.
[10]
Prati R., Batista G.E.A.P.A., Monard M.C., A survey on graphical methods for classification predictive performance evaluation, IEEE Trans. Knowl. Data Eng. 23 (11) (2011) 1601–1618.
[11]
Montgomery D.C., Design and Analysis of Experiments, John Wiley & Sons, 2006.
[12]
Benavoli A., Corani G., Demšar J., Zaffalon M., Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis, J. Mach. Learn. Res. 18 (77) (2017) 1–36.
[13]
Dietterich T.G., Approximate statistical tests for comparing supervised classification learning algorithms, Neural Comput. 10 (7) (1998) 1895–1923.
[14]
Evaluating Learning Algorithms: A Classification Perspective, Cambridge University Press, 2011.
[15]
Salzberg S.L., On comparing classifiers: Pitfalls to avoid and a recommended approach, Data Min. Knowl. Discov. 1 (3) (1997) 317–328.
[16]
Santafe G., Inza I., Lozano J.A., Dealing with the evaluation of supervised classification algorithms, Artif. Intell. Rev. 44 (4) (2015) 467–508.
[17]
K. Stapor, Evaluation of classifiers: current methods and future research directions, in: Position Papers of the Federated Conference on Computer Science and Information Systems, FedCSIS, 2017, pp. 37–40.
[18]
Burduk R., Possibility of use a fuzzy loss function in medical diagnostics, in: Pietka E., Kawa J. (Eds.), Information Technologies in Biomedicine, Springer Berlin Heidelberg, Berlin, Heidelberg, 2008, pp. 476–481.
[19]
P. Branco, Utility-Based Predictive Analytics (Ph.D. thesis).
[20]
García V., Mollineda R.A., Sánchez J.S., Index of balanced accuracy: A performance measure for skewed class distributions, in: Iberian Conference on Pattern Recognition and Image Analysis, Springer, 2009, pp. 441–448.
[21]
Sokolova M., Lapalme G., A systematic analysis of performance measures for classification tasks, Inf. Process. Manag. 45 (4) (2009) 427–437.
[22]
Brzezinski D., Stefanowski J., Susmaga R., Szczȩch I., Visual-based analysis of classification measures and their properties for class imbalanced problems, Inform. Sci. 462 (2018) 242–261.
[23]
Brzezinski D., Stefanowski J., Susmaga R., Szczȩch I., On the dynamics of classification measures for imbalanced and streaming data, IEEE Trans. Neural Netw. Learn. Syst. (2019) 1–11.
[24]
Bradley A.P., The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognit. 30 (7) (1997) 1145–1159.
[25]
Hand D.J., Measuring classifier performance: a coherent alternative to the area under the ROC curve, Mach. Learn. 77 (1) (2009) 103–123.
[26]
P. Branco, L. Torgo, R.P. Ribeiro, Relevance-based evaluation metrics for multi-class imbalanced domains, in: Advances in Knowledge Discovery and Data Mining - 21st Pacific-Asia Conference, PAKDD 2017, Jeju, South Korea, May 23–26, 2017, Proceedings, Part I, 2017, pp. 698–710.
[27]
Delgado R., Núñez-González J.D., Enhancing Confusion Entropy (CEN) for binary and multiclass classification, PLoS One 14 (1) (2019) 1–30.
[28]
Krawczyk B., Minku L.L., Gama J., Stefanowski J., Woźniak M., Ensemble learning for data stream analysis: A survey, Inf. Fusion 37 (2017) 132–156.
[29]
Soong T., Fundamentals of Probability and Statistics for Engineers, John Wiley and Sons, 2004.
[30]
Domingos P., A unified bias-variance decomposition for zero-one and squared loss, in: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, AAAI Press, 2000, pp. 564–569.
[31]
Efron B., Bootstrap methods: another look at the jackknife, Ann. Statist. 7 (1) (1979) 1–26.
[32]
Hastie T., Tibshirani R., Efron B., The Elements of Statistical Learning, Springer, 2001.
[33]
Efron B., The Jackknife, The Bootstrap and other Resampling Plans, Society for Industrial and Applied Mathematics, PA, 1982.
[34]
Burman P., A comparative study of ordinal cross-validation, v-vold cross-validation and the repeated learning-testing methods, Biometrika 76 (3) (1989) 503–514.
[35]
Bouckaert R.R., Estimating replicability of classifier learning experiments, in: Proceedings of the Twenty-First International Conference on Machine Learning, ACM, 2004, p. 15.
[36]
Moreno-Torres J., et al., A unifying view on dataset shift in classification, Pattern Recognit. 45 (1) (2012) 521–530.
[37]
Moreno-Torres J., Herrera F., Study on the impact of partition-induced dataset shift on k-fold cross-validation, IEEE Trans. Neural Netw. Learn. Syst. 23 (8) (2012) 1304–1312.
[38]
Wolpert D.H., The lack of a priori distinctions between learning algorithms, Neural Comput. 8 (7) (1996) 1341–1390.
[39]
Ojala M., Garriga G., Permutation tests for studying classifier performance, J. Mach. Learn. Res. 11 (2) (2010) 1833–1863.
[40]
Drummond C., Japkowicz N., Warning: statistical benchmarking is addictive. Kicking the habit in machine learning, J. Exp. Theor. Artif. Intell. 22 (1) (2010) 67–80.
[41]
Goodman S., A dirty dozen: twelve p-value misconceptions, in: Seminars in Hematology, Vol. 45, No. 3, Elsevier, 2008, pp. 135–140.
[42]
Shaffer J.P., Multiple hypothesis testing, Ann. Rev. Psychol. 46 (1) (1995) 561–584.
[43]
Hollander M., Wolfe D.A., Chicken E., Nonparametric Statistical Methods, Vol. 751, John Wiley & Sons, 2013.
[44]
Berrar D., Lozano J.A., Significance tests or confidence intervals: which are preferable for the comparison of classifiers?, J. Exp. Theor. Artif. Intell. 25 (2) (2013) 189–206.
[45]
Masson M.E., A tutorial on a practical Bayesian alternative to null-hypothesis significance testing, Behav. Res. Methods 43 (3) (2011) 679–690.
[46]
Quade D., Using weighted rankings in the analysis of complete blocks with additive block effects, J. Amer. Statist. Assoc. 74 (367) (1979) 680–683.
[47]
Yu Z., Wang Z., You J., Zhang J., Liu J., Wong H., Han G., A new kind of nonparametric test for statistical comparison of multiple classifiers over multiple datasets, IEEE Trans. Cybern. 47 (12) (2017) 4418–4431.
[48]
Dua D., Graff C., UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, 2017.
[49]
Herrera F., Charte F., Rivera A.J., del Jesus M.J., Multilabel Classification: Problem Analysis, Metrics and Techniques, first ed., Springer Publishing Company, Incorporated, 2016.
[50]
J. Cohen, The world is round (p ¡ .05), Am. Psychol. (49) 997–1003.
[51]
Simmons J.P., Nelson L.D., Simonsohn U., False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant, Psychol. Sci. 22 (11) (2011) 1359–1366.
[52]
Nuzzo R., Scientific method: statistical errors, Nature 506 (7487) (2014) 150–152.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Applied Soft Computing
Applied Soft Computing  Volume 104, Issue C
Jun 2021
774 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 June 2021

Author Tags

  1. Statistical tests
  2. Classifier evaluation
  3. Credibility of model evaluation
  4. Experimental protocol

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024) torchosr — A PyTorch extension package for Open Set Recognition models evaluation in Python Neurocomputing10.1016/j.neucom.2023.127047566:COnline publication date: 21-Jan-2024
  • (2024)A scalable dynamic ensemble selection using fuzzy hyperboxesInformation Fusion10.1016/j.inffus.2023.102036102:COnline publication date: 1-Feb-2024
  • (2024)Application of supervised random forest paradigms based on optimization and post-hoc explanation in underground stope stability predictionApplied Soft Computing10.1016/j.asoc.2024.111388154:COnline publication date: 1-Mar-2024
  • (2024)Local neighborhood encodings for imbalanced data classificationMachine Language10.1007/s10994-024-06563-6113:10(7421-7449)Online publication date: 10-Jun-2024
  • (2024)Study regarding the influence of a student’s personality and an LMS usage profile on learning performance using machine learning techniquesApplied Intelligence10.1007/s10489-024-05483-154:8(6175-6197)Online publication date: 1-Apr-2024
  • (2023)A Bayesian Bradley-Terry model to compare multiple ML algorithms on multiple data setsThe Journal of Machine Learning Research10.5555/3648699.364904024:1(16307-16340)Online publication date: 1-Jan-2023
  • (2023) problexity—An open-source Python library for supervised learning problem complexity assessmentNeurocomputing10.1016/j.neucom.2022.11.056521:C(126-136)Online publication date: 7-Feb-2023
  • (2023)Influence of personality and modality on peer assessment evaluation perceptions using Machine Learning techniquesExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.119150213:PCOnline publication date: 1-Mar-2023
  • (2023)Building interpretable predictive models with context-aware evolutionary learningApplied Soft Computing10.1016/j.asoc.2022.109854132:COnline publication date: 1-Jan-2023
  • (2023)DE-Forest – Optimized Decision Tree EnsembleComputational Collective Intelligence10.1007/978-3-031-41456-5_61(806-818)Online publication date: 27-Sep-2023
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media