research-article

How to design the fair experimental classifier evaluation

Authors:

Katarzyna Stapor,

Paweł Ksieniewicz,

Salvador García,

Michał WoźniakAuthors Info & Claims

Volume 104, Issue C

https://doi.org/10.1016/j.asoc.2021.107219

Published: 01 June 2021 Publication History

Abstract

Many researchers working on classification problems evaluate the quality of developed algorithms based on computer experiments. The conclusions drawn from them are usually supported by the statistical analysis and chosen experimental protocol. Statistical tests are widely used to confirm whether considered methods significantly outperform reference classifiers. Usually, the tests are applied to stratified datasets, which could raise the question of whether data folds used for classification are really randomly drawn and how the statistical analysis supports robust conclusions. Unfortunately, some scientists do not realize the real meaning of the obtained results and overinterpret them. They do not see that inappropriate use of such analytical tools may lead them into a trap. This paper aims to show the commonly used experimental protocols’ weaknesses and discuss if we really can trust in such evaluation methodology, if all presented evaluations are fair and if it is possible to manipulate the experimental results using well-known statistical evaluation methods. We will present that it is possible to choose only such results, confirming the experimenter’s expectation. We will try to show what could be done to avoid such likely unethical behavior. At the end of this work, we will formulate recommendations on improving an experimental protocol to design fair experimental classifier evaluation.

Highlights

•

Presenting the weakness of the commonly used experimental protocols.

•

Discussing if all reported evaluations are always fair.

•

Demonstrating how to manipulate the experimental results using well-known statistical evaluation methods.

•

Showing the possibility of choosing only such results which can confirm the expectation of the experimenter.

•

Recommending how to design fair experimental classifier evaluation to avoid likely unethical behavior.

References

[1]

Bishop C.M., Pattern Recognition and Machine Learning, Springer, 2006.

Digital Library

[2]

D.H. Wolpert, The supervised learning no-free-lunch Theorems, in: Proc. 6th Online World Conference on Soft Computing in Industrial Applications, 2001, pp. 25–42.

[3]

Demšar J., Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30.

Digital Library

[4]

García S., Herrera F., An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons, J. Mach. Learn. Res. 9 (2009) 2677–2694.

[5]

García S., Fernández A., Luengo J., Herrera F., Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power, Inform. Sci. 180 (10) (2010) 2044–2064.

Digital Library

[6]

Fanelli D., How many scientists fabricate and falsify research? A systematic review and meta-analysis of survey data, PLoS One 4 (5) (2009) 1–11.

[7]

Duda R.O., Hart P.E., Stork D.G., Pattern Classification, second ed., Wiley, New York, 2001.

[8]

Devijver P., Kittler J., Pattern Recognition: A Statistical Approach, Prentice/Hall International, 1982.

[9]

Alpaydin E., Introduction to Machine Learning, fourth ed., The MIT Press, 2014.

[10]

Prati R., Batista G.E.A.P.A., Monard M.C., A survey on graphical methods for classification predictive performance evaluation, IEEE Trans. Knowl. Data Eng. 23 (11) (2011) 1601–1618.

[11]

Montgomery D.C., Design and Analysis of Experiments, John Wiley & Sons, 2006.

Digital Library

[12]

Benavoli A., Corani G., Demšar J., Zaffalon M., Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis, J. Mach. Learn. Res. 18 (77) (2017) 1–36.

[13]

Dietterich T.G., Approximate statistical tests for comparing supervised classification learning algorithms, Neural Comput. 10 (7) (1998) 1895–1923.

Digital Library

[14]

Evaluating Learning Algorithms: A Classification Perspective, Cambridge University Press, 2011.

[15]

Salzberg S.L., On comparing classifiers: Pitfalls to avoid and a recommended approach, Data Min. Knowl. Discov. 1 (3) (1997) 317–328.

[16]

Santafe G., Inza I., Lozano J.A., Dealing with the evaluation of supervised classification algorithms, Artif. Intell. Rev. 44 (4) (2015) 467–508.

Digital Library

[17]

K. Stapor, Evaluation of classifiers: current methods and future research directions, in: Position Papers of the Federated Conference on Computer Science and Information Systems, FedCSIS, 2017, pp. 37–40.

[18]

Burduk R., Possibility of use a fuzzy loss function in medical diagnostics, in: Pietka E., Kawa J. (Eds.), Information Technologies in Biomedicine, Springer Berlin Heidelberg, Berlin, Heidelberg, 2008, pp. 476–481.

[19]

P. Branco, Utility-Based Predictive Analytics (Ph.D. thesis).

[20]

García V., Mollineda R.A., Sánchez J.S., Index of balanced accuracy: A performance measure for skewed class distributions, in: Iberian Conference on Pattern Recognition and Image Analysis, Springer, 2009, pp. 441–448.

[21]

Sokolova M., Lapalme G., A systematic analysis of performance measures for classification tasks, Inf. Process. Manag. 45 (4) (2009) 427–437.

Digital Library

[22]

Brzezinski D., Stefanowski J., Susmaga R., Szczȩch I., Visual-based analysis of classification measures and their properties for class imbalanced problems, Inform. Sci. 462 (2018) 242–261.

[23]

Brzezinski D., Stefanowski J., Susmaga R., Szczȩch I., On the dynamics of classification measures for imbalanced and streaming data, IEEE Trans. Neural Netw. Learn. Syst. (2019) 1–11.

[24]

Bradley A.P., The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognit. 30 (7) (1997) 1145–1159.

Digital Library

[25]

Hand D.J., Measuring classifier performance: a coherent alternative to the area under the ROC curve, Mach. Learn. 77 (1) (2009) 103–123.

Digital Library

[26]

P. Branco, L. Torgo, R.P. Ribeiro, Relevance-based evaluation metrics for multi-class imbalanced domains, in: Advances in Knowledge Discovery and Data Mining - 21st Pacific-Asia Conference, PAKDD 2017, Jeju, South Korea, May 23–26, 2017, Proceedings, Part I, 2017, pp. 698–710.

[27]

Delgado R., Núñez-González J.D., Enhancing Confusion Entropy (CEN) for binary and multiclass classification, PLoS One 14 (1) (2019) 1–30.

[28]

Krawczyk B., Minku L.L., Gama J., Stefanowski J., Woźniak M., Ensemble learning for data stream analysis: A survey, Inf. Fusion 37 (2017) 132–156.

Digital Library

[29]

Soong T., Fundamentals of Probability and Statistics for Engineers, John Wiley and Sons, 2004.

[30]

Domingos P., A unified bias-variance decomposition for zero-one and squared loss, in: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, AAAI Press, 2000, pp. 564–569.

[31]

Efron B., Bootstrap methods: another look at the jackknife, Ann. Statist. 7 (1) (1979) 1–26.

[32]

Hastie T., Tibshirani R., Efron B., The Elements of Statistical Learning, Springer, 2001.

[33]

Efron B., The Jackknife, The Bootstrap and other Resampling Plans, Society for Industrial and Applied Mathematics, PA, 1982.

[34]

Burman P., A comparative study of ordinal cross-validation, v-vold cross-validation and the repeated learning-testing methods, Biometrika 76 (3) (1989) 503–514.

[35]

Bouckaert R.R., Estimating replicability of classifier learning experiments, in: Proceedings of the Twenty-First International Conference on Machine Learning, ACM, 2004, p. 15.

[36]

Moreno-Torres J., et al., A unifying view on dataset shift in classification, Pattern Recognit. 45 (1) (2012) 521–530.

Digital Library

[37]

Moreno-Torres J., Herrera F., Study on the impact of partition-induced dataset shift on k-fold cross-validation, IEEE Trans. Neural Netw. Learn. Syst. 23 (8) (2012) 1304–1312.

[38]

Wolpert D.H., The lack of a priori distinctions between learning algorithms, Neural Comput. 8 (7) (1996) 1341–1390.

Digital Library

[39]

Ojala M., Garriga G., Permutation tests for studying classifier performance, J. Mach. Learn. Res. 11 (2) (2010) 1833–1863.

[40]

Drummond C., Japkowicz N., Warning: statistical benchmarking is addictive. Kicking the habit in machine learning, J. Exp. Theor. Artif. Intell. 22 (1) (2010) 67–80.

[41]

Goodman S., A dirty dozen: twelve p-value misconceptions, in: Seminars in Hematology, Vol. 45, No. 3, Elsevier, 2008, pp. 135–140.

[42]

Shaffer J.P., Multiple hypothesis testing, Ann. Rev. Psychol. 46 (1) (1995) 561–584.

[43]

Hollander M., Wolfe D.A., Chicken E., Nonparametric Statistical Methods, Vol. 751, John Wiley & Sons, 2013.

[44]

Berrar D., Lozano J.A., Significance tests or confidence intervals: which are preferable for the comparison of classifiers?, J. Exp. Theor. Artif. Intell. 25 (2) (2013) 189–206.

[45]

Masson M.E., A tutorial on a practical Bayesian alternative to null-hypothesis significance testing, Behav. Res. Methods 43 (3) (2011) 679–690.

[46]

Quade D., Using weighted rankings in the analysis of complete blocks with additive block effects, J. Amer. Statist. Assoc. 74 (367) (1979) 680–683.

[47]

Yu Z., Wang Z., You J., Zhang J., Liu J., Wong H., Han G., A new kind of nonparametric test for statistical comparison of multiple classifiers over multiple datasets, IEEE Trans. Cybern. 47 (12) (2017) 4418–4431.

[48]

Dua D., Graff C., UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, 2017.

[49]

Herrera F., Charte F., Rivera A.J., del Jesus M.J., Multilabel Classification: Problem Analysis, Metrics and Techniques, first ed., Springer Publishing Company, Incorporated, 2016.

[50]

J. Cohen, The world is round (p ¡ .05), Am. Psychol. (49) 997–1003.

[51]

Simmons J.P., Nelson L.D., Simonsohn U., False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant, Psychol. Sci. 22 (11) (2011) 1359–1366.

[52]

Nuzzo R., Scientific method: statistical errors, Nature 506 (7487) (2014) 150–152.

Cited By

Komorniczak JKsieniewicz P(2024) torchosr — A PyTorch extension package for Open Set Recognition models evaluation in Python Neurocomputing10.1016/j.neucom.2023.127047566:COnline publication date: 21-Jan-2024
https://dl.acm.org/doi/10.1016/j.neucom.2023.127047
Davtalab RCruz RSabourin R(2024)A scalable dynamic ensemble selection using fuzzy hyperboxesInformation Fusion10.1016/j.inffus.2023.102036102:COnline publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1016/j.inffus.2023.102036
Li CMei XZhang J(2024)Application of supervised random forest paradigms based on optimization and post-hoc explanation in underground stope stability predictionApplied Soft Computing10.1016/j.asoc.2024.111388154:COnline publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1016/j.asoc.2024.111388
Show More Cited By

Index Terms

How to design the fair experimental classifier evaluation

Index terms have been assigned to the content through auto-classification.

Recommendations

An Experimental Study on Pedestrian Classification

Detecting people in images is key for several important application domains in computer vision. This paper presents an in--depth experimental study on pedestrian classification; multiple feature-classifier combinations are examined with respect to their ...
Design and Development of an Experimental Design and Evaluation System
IMCCC '14: Proceedings of the 2014 Fourth International Conference on Instrumentation and Measurement, Computer, Communication and Control

This paper develops an experimental and Evaluation system which accelerates the construction of experimental scheme. At the same time, it helps with the creation of experiment samples and evaluation of the impact of experiment variable attributes. Based ...
SMART protocols: semantic representation for experimental protocols
LISC'14: Proceedings of the 4th International Conference on Linked Science - Volume 1282

Two important characteristics of science are the "reproducibility" and "clarity". By rigorous practices, scientists explore aspects of the world that they can reproduce under carefully controlled experimental conditions. The clarity, complementing ...

Comments

Information & Contributors

Information

Published In

cover image Applied Soft Computing

Applied Soft Computing Volume 104, Issue C

Jun 2021

774 pages

ISSN:1568-4946

Issue’s Table of Contents

Elsevier B.V.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 June 2021

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Komorniczak JKsieniewicz P(2024) torchosr — A PyTorch extension package for Open Set Recognition models evaluation in Python Neurocomputing10.1016/j.neucom.2023.127047566:COnline publication date: 21-Jan-2024
https://dl.acm.org/doi/10.1016/j.neucom.2023.127047
Davtalab RCruz RSabourin R(2024)A scalable dynamic ensemble selection using fuzzy hyperboxesInformation Fusion10.1016/j.inffus.2023.102036102:COnline publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1016/j.inffus.2023.102036
Li CMei XZhang J(2024)Application of supervised random forest paradigms based on optimization and post-hoc explanation in underground stope stability predictionApplied Soft Computing10.1016/j.asoc.2024.111388154:COnline publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1016/j.asoc.2024.111388
Koziarski MWoźniak M(2024)Local neighborhood encodings for imbalanced data classificationMachine Language10.1007/s10994-024-06563-6113:10(7421-7449)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1007/s10994-024-06563-6
Rico-Juan JCachero CMacià H(2024)Study regarding the influence of a student’s personality and an LMS usage profile on learning performance using machine learning techniquesApplied Intelligence10.1007/s10489-024-05483-154:8(6175-6197)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1007/s10489-024-05483-1
Wainer J(2023)A Bayesian Bradley-Terry model to compare multiple ML algorithms on multiple data setsThe Journal of Machine Learning Research10.5555/3648699.364904024:1(16307-16340)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.5555/3648699.3649040
Komorniczak JKsieniewicz P(2023) problexity—An open-source Python library for supervised learning problem complexity assessmentNeurocomputing10.1016/j.neucom.2022.11.056521:C(126-136)Online publication date: 7-Feb-2023
https://dl.acm.org/doi/10.1016/j.neucom.2022.11.056
Cachero CRico-Juan JMacià H(2023)Influence of personality and modality on peer assessment evaluation perceptions using Machine Learning techniquesExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.119150213:PCOnline publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1016/j.eswa.2022.119150
Tran BSudusinghe CNguyen SAlahakoon D(2023)Building interpretable predictive models with context-aware evolutionary learningApplied Soft Computing10.1016/j.asoc.2022.109854132:COnline publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1016/j.asoc.2022.109854
Grzyb JWoźniak M(2023)DE-Forest – Optimized Decision Tree EnsembleComputational Collective Intelligence10.1007/978-3-031-41456-5_61(806-818)Online publication date: 27-Sep-2023
https://dl.acm.org/doi/10.1007/978-3-031-41456-5_61
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents