article

Learner excellence biased by data set selection: A case for data characterisation and artificial data sets

Authors:

Ester Bernadó-Mansilla,

Albert Orriols-Puig,

Tin Kam HoAuthors Info & Claims

Pattern Recognition, Volume 46, Issue 3

Pages 1054 - 1066

https://doi.org/10.1016/j.patcog.2012.09.022

Published: 01 March 2013 Publication History

Abstract

The excellence of a given learner is usually claimed through a performance comparison with other learners over a collection of data sets. Too often, researchers are not aware of the impact of their data selection on the results. Their test beds are small, and the selection of the data sets is not supported by any previous data analysis. Conclusions drawn on such test beds cannot be generalised, because particular data characteristics may favour certain learners unnoticeably. This work raises these issues and proposes the characterisation of data sets using complexity measures, which can be helpful for both guiding experimental design and explaining the behaviour of learners.

References

[1]

Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J. and Steinberg, D., Top 10 algorithms in data mining. Knowledge and Information Systems. v14 i1. 1-37.

[2]

R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in: International Joint Conferences on Artificial Intelligence, vol. 14, 1995, pp. 1137-1145.

[3]

Dietterich, T.G., Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation. v10 i7. 1895-1924.

[4]

Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research. v7. 1-30.

Digital Library

[5]

García, S. and Herrera, F., An extension on "statistical comparisons of classifiers over multiple data sets" for all pairwise comparisons. Journal of Machine Learning Research. v9. 2677-2694.

[6]

Ho, T.K. and Basu, M., Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence. v24 i3. 289-300.

[7]

Wolpert, D.H., The lack of a priori distinctions between learning algorithms. Neural Computation. v8 i7. 1341-1390.

[8]

Luengo, J. and Herrera, F., Domains of competence of fuzzy rule based classification systems with data complexity measures: a case of study using a fuzzy hybrid genetic based machine learning method. . Fuzzy Sets and Systems. v161 i1. 3-19.

[9]

A. Orriols-Puig, J. Casillas, Fuzzy knowledge representation study for incremental learning in data streams and classification problems, Soft Computing 15 (12) (2010) 2389-2414. 10.1007/s00500-010-0668-x.

Digital Library

[10]

Aha, D.W., Kibler, D. and Albert, M.K., Instance-based learning algorithms. Machine Learning. v6 i1. 37-66.

[11]

Breiman, L., Random forests. Machine Learning. v45 i1. 5-32.

[12]

Platt, J.C., Fast training of support vector machines using sequential minimal optimization. 1999. Advances in Kernel Methods, 1999.MIT Press.

[13]

Vapnik, V.N., The Nature of Statistical Learning Theory. 1995. Springer Verlag.

[14]

Witten, I.H. and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques. . 2005. second ed. Morgan Kaufmann, San Francisco.

[15]

Friedman, M., A comparison of alternative tests of significance for the problem of m rankings. Annals of Mathematical Statistics. v11. 86-92.

[16]

Holm, S., A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics. v6. 65-70.

[17]

Bernadó-Mansilla, E. and Ho, T.K., Domain of competence of XCS classifier system in complexity measurement space. IEEE Transactions on Evolutionary Computation. v9 i1. 82-104.

[18]

A. Orriols-Puig, N. Macií, T.K. Ho, Documentation for the data complexity library in C++, Technical Report, La Salle - Universitat Ramon Llull, 2010.

[19]

Sánchez, J.S., Mollineda, R.A. and Sotoca, J.M., An analysis of how training data complexity affects the nearest neighbor classifiers. Pattern Analysis and Applications. v10. 189-201.

Digital Library

[20]

García, S., Cano, J.R., Bernadó-Mansilla, E. and Herrera, F., Diagnose of effective evolutionary prototype selection using an overlapping measure. International Journal of Pattern Recognition and Artificial Intelligence. v23 i8. 1527-1548.

[21]

Macií, N., Ho, T.K., Orriols-Puig, A. and Bernadó-Mansilla, E., The landscape contest at ICPR'10. 2010. ICPR 2010, Lecture Note in Computer Science.Springer.

[22]

J. Luengo, A. Fernández, S. García, F. Herrera, Addressing data complexity for imbalanced data sets: Analysis of SMOTE-based oversampling and evolutionary undersampling, Soft Computing.Soft Computing 15 (10) (2011) 1909-1936. 10.1007/s00500-010-0625-8

Digital Library

[23]

N. Macií, A. Orriols-Puig, E. Bernadó-Mansilla, In search of targeted-complexity problems, in: Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, ACM, 2010, pp. 1055-1062.

[24]

W.W. Cohen, Fast effective rule induction, in: International Conference on Machine Learning, 1995, pp. 115-123.

Digital Library

[25]

Holte, R.C., Very simple classification rules perform well on most commonly used datasets. Machine Learning. v11. 63-90.

Digital Library

[26]

Coello, C.A., Lamont, G.B. and Veldhuizen, D.A.V., Evolutionary Algorithms for Solving Multi-objective Problems (Genetic and Evolutionary Computation). 2006. Springer-Verlag New York, Inc., Secaucus, NJ, USA.

[27]

Deb, K.D., Pratap, A., Agarwal, S. and Meyarivan, T., A fast and elitist multiobjective genetic algorithm: NSGA-II. . IEEE Transactions on Evolutionary Computation. v6 i2. 182-197.

[28]

Friedman, J.H. and Rafsky, L.C., Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. Annals of Statistics. v7 i7. 697-717.

[29]

Lebourgeois, F. and Emptoz, H., Pretopological approach for supervised learning. 1996. Proceedings of the 13th International Conference on Pattern Recognition, 1996.IEEE Computer Society, Washington DC, USA.

Cited By

Pan BChen WDeng LXu CZhou X(2023)Classifier selection using geometry preserving featureNeural Computing and Applications10.1007/s00521-023-08828-y35:28(20955-20976)Online publication date: 28-Jul-2023
https://dl.acm.org/doi/10.1007/s00521-023-08828-y
Singh DSaha AGosain A(2021)wCM based hybrid pre-processing algorithm for class imbalanced datasetJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-21062441:2(3339-3354)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.3233/JIFS-210624
Raghuwanshi BShukla S(2021)Classifying multiclass imbalanced data using generalized class-specific extreme learning machineProgress in Artificial Intelligence10.1007/s13748-021-00236-410:3(259-281)Online publication date: 1-Sep-2021
https://dl.acm.org/doi/10.1007/s13748-021-00236-4
Show More Cited By

Learner excellence biased by data set selection: A case for data characterisation and artificial data sets
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
    2. Machine learning approaches

Recommendations

Self-assessment and task selection in learner-controlled instruction: Differences between effective and ineffective learners

Learner-controlled instruction is often found to be less effective for learning than fixed or adaptive system-controlled instruction. One possible reason for that finding is that students - especially novices - might not able to accurately assess their ...
Guiding Intuitive Learning in Serious Games: An Achievement-Based Approach to Externalized Feedback and Assessment
CISIS '12: Proceedings of the 2012 Sixth International Conference on Complex, Intelligent, and Software Intensive Systems (CISIS)

Despite the rapid emergence of game-based learning as a method for conveying educational content, constructing pedagogies which effectively combine elements of entertainment gaming with methods of instruction remains a demanding task. Through the notion ...
Predicting the learner's personality from educational data using supervised learning
SITA'18: Proceedings of the 12th International Conference on Intelligent Systems: Theories and Applications

Differences in the learners' personality have an impact on their learning outcomes and achievements. Therefore, there is a need to automatically predict and identify their personalities in an unobtrusive way, and build the learner model accordingly. In ...

Comments

Information & Contributors

Information

Published In

Copyright © Elsevier Ltd © 2012.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 March 2013

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Pan BChen WDeng LXu CZhou X(2023)Classifier selection using geometry preserving featureNeural Computing and Applications10.1007/s00521-023-08828-y35:28(20955-20976)Online publication date: 28-Jul-2023
https://dl.acm.org/doi/10.1007/s00521-023-08828-y
Singh DSaha AGosain A(2021)wCM based hybrid pre-processing algorithm for class imbalanced datasetJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-21062441:2(3339-3354)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.3233/JIFS-210624
Raghuwanshi BShukla S(2021)Classifying multiclass imbalanced data using generalized class-specific extreme learning machineProgress in Artificial Intelligence10.1007/s13748-021-00236-410:3(259-281)Online publication date: 1-Sep-2021
https://dl.acm.org/doi/10.1007/s13748-021-00236-4
Singh DGosain ASaha A(2020)Weighted k‐nearest neighbor based data complexity metrics for imbalanced datasetsStatistical Analysis and Data Mining10.1002/sam.1146313:4(394-404)Online publication date: 2-Jul-2020
https://dl.acm.org/doi/10.1002/sam.11463
Lorena AGarcia LLehmann JSouto MHo T(2019)How Complex Is Your Classification Problem?ACM Computing Surveys10.1145/334771152:5(1-34)Online publication date: 13-Sep-2019
https://dl.acm.org/doi/10.1145/3347711
Brun ABritto AOliveira LEnembreck FSabourin R(2018)A framework for dynamic classifier selection oriented by the classification problem difficultyPattern Recognition10.1016/j.patcog.2017.10.03876:C(175-190)Online publication date: 1-Apr-2018
https://dl.acm.org/doi/10.1016/j.patcog.2017.10.038
Deng LChen WPan B(2018)Automatic Classifier Selection Based on Classification ComplexityPattern Recognition and Computer Vision10.1007/978-3-030-03338-5_25(292-303)Online publication date: 23-Nov-2018
https://dl.acm.org/doi/10.1007/978-3-030-03338-5_25
Yao YXiao ZWang BViswanath BZheng HZhao BUhlig SMaennel O(2017)Complexity vs. performanceProceedings of the 2017 Internet Measurement Conference10.1145/3131365.3131372(384-397)Online publication date: 1-Nov-2017
https://dl.acm.org/doi/10.1145/3131365.3131372
Morán-Fernández LBolón-Canedo VAlonso-Betanzos A(2017)Centralized vs. distributed feature selection methods based on data complexity measuresKnowledge-Based Systems10.1016/j.knosys.2016.09.022117:C(27-45)Online publication date: 1-Feb-2017
https://dl.acm.org/doi/10.1016/j.knosys.2016.09.022
Morán-Fernández LBolón-Canedo VAlonso-Betanzos A(2017)Can classification performance be predicted by complexity measures? A study using microarray dataKnowledge and Information Systems10.1007/s10115-016-1003-351:3(1067-1090)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1007/s10115-016-1003-3
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents