Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Learner excellence biased by data set selection: A case for data characterisation and artificial data sets

Published: 01 March 2013 Publication History

Abstract

The excellence of a given learner is usually claimed through a performance comparison with other learners over a collection of data sets. Too often, researchers are not aware of the impact of their data selection on the results. Their test beds are small, and the selection of the data sets is not supported by any previous data analysis. Conclusions drawn on such test beds cannot be generalised, because particular data characteristics may favour certain learners unnoticeably. This work raises these issues and proposes the characterisation of data sets using complexity measures, which can be helpful for both guiding experimental design and explaining the behaviour of learners.

References

[1]
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J. and Steinberg, D., Top 10 algorithms in data mining. Knowledge and Information Systems. v14 i1. 1-37.
[2]
R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in: International Joint Conferences on Artificial Intelligence, vol. 14, 1995, pp. 1137-1145.
[3]
Dietterich, T.G., Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation. v10 i7. 1895-1924.
[4]
Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research. v7. 1-30.
[5]
García, S. and Herrera, F., An extension on "statistical comparisons of classifiers over multiple data sets" for all pairwise comparisons. Journal of Machine Learning Research. v9. 2677-2694.
[6]
Ho, T.K. and Basu, M., Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence. v24 i3. 289-300.
[7]
Wolpert, D.H., The lack of a priori distinctions between learning algorithms. Neural Computation. v8 i7. 1341-1390.
[8]
Luengo, J. and Herrera, F., Domains of competence of fuzzy rule based classification systems with data complexity measures: a case of study using a fuzzy hybrid genetic based machine learning method. . Fuzzy Sets and Systems. v161 i1. 3-19.
[9]
A. Orriols-Puig, J. Casillas, Fuzzy knowledge representation study for incremental learning in data streams and classification problems, Soft Computing 15 (12) (2010) 2389-2414. 10.1007/s00500-010-0668-x.
[10]
Aha, D.W., Kibler, D. and Albert, M.K., Instance-based learning algorithms. Machine Learning. v6 i1. 37-66.
[11]
Breiman, L., Random forests. Machine Learning. v45 i1. 5-32.
[12]
Platt, J.C., Fast training of support vector machines using sequential minimal optimization. 1999. Advances in Kernel Methods, 1999.MIT Press.
[13]
Vapnik, V.N., The Nature of Statistical Learning Theory. 1995. Springer Verlag.
[14]
Witten, I.H. and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques. . 2005. second ed. Morgan Kaufmann, San Francisco.
[15]
Friedman, M., A comparison of alternative tests of significance for the problem of m rankings. Annals of Mathematical Statistics. v11. 86-92.
[16]
Holm, S., A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics. v6. 65-70.
[17]
Bernadó-Mansilla, E. and Ho, T.K., Domain of competence of XCS classifier system in complexity measurement space. IEEE Transactions on Evolutionary Computation. v9 i1. 82-104.
[18]
A. Orriols-Puig, N. Macií, T.K. Ho, Documentation for the data complexity library in C++, Technical Report, La Salle - Universitat Ramon Llull, 2010.
[19]
Sánchez, J.S., Mollineda, R.A. and Sotoca, J.M., An analysis of how training data complexity affects the nearest neighbor classifiers. Pattern Analysis and Applications. v10. 189-201.
[20]
García, S., Cano, J.R., Bernadó-Mansilla, E. and Herrera, F., Diagnose of effective evolutionary prototype selection using an overlapping measure. International Journal of Pattern Recognition and Artificial Intelligence. v23 i8. 1527-1548.
[21]
Macií, N., Ho, T.K., Orriols-Puig, A. and Bernadó-Mansilla, E., The landscape contest at ICPR'10. 2010. ICPR 2010, Lecture Note in Computer Science.Springer.
[22]
J. Luengo, A. Fernández, S. García, F. Herrera, Addressing data complexity for imbalanced data sets: Analysis of SMOTE-based oversampling and evolutionary undersampling, Soft Computing.Soft Computing 15 (10) (2011) 1909-1936. 10.1007/s00500-010-0625-8
[23]
N. Macií, A. Orriols-Puig, E. Bernadó-Mansilla, In search of targeted-complexity problems, in: Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, ACM, 2010, pp. 1055-1062.
[24]
W.W. Cohen, Fast effective rule induction, in: International Conference on Machine Learning, 1995, pp. 115-123.
[25]
Holte, R.C., Very simple classification rules perform well on most commonly used datasets. Machine Learning. v11. 63-90.
[26]
Coello, C.A., Lamont, G.B. and Veldhuizen, D.A.V., Evolutionary Algorithms for Solving Multi-objective Problems (Genetic and Evolutionary Computation). 2006. Springer-Verlag New York, Inc., Secaucus, NJ, USA.
[27]
Deb, K.D., Pratap, A., Agarwal, S. and Meyarivan, T., A fast and elitist multiobjective genetic algorithm: NSGA-II. . IEEE Transactions on Evolutionary Computation. v6 i2. 182-197.
[28]
Friedman, J.H. and Rafsky, L.C., Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. Annals of Statistics. v7 i7. 697-717.
[29]
Lebourgeois, F. and Emptoz, H., Pretopological approach for supervised learning. 1996. Proceedings of the 13th International Conference on Pattern Recognition, 1996.IEEE Computer Society, Washington DC, USA.

Cited By

View all
  • (2023)Classifier selection using geometry preserving featureNeural Computing and Applications10.1007/s00521-023-08828-y35:28(20955-20976)Online publication date: 28-Jul-2023
  • (2021)wCM based hybrid pre-processing algorithm for class imbalanced datasetJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-21062441:2(3339-3354)Online publication date: 1-Jan-2021
  • (2021)Classifying multiclass imbalanced data using generalized class-specific extreme learning machineProgress in Artificial Intelligence10.1007/s13748-021-00236-410:3(259-281)Online publication date: 1-Sep-2021
  • Show More Cited By
  1. Learner excellence biased by data set selection: A case for data characterisation and artificial data sets

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      Publisher

      Elsevier Science Inc.

      United States

      Publication History

      Published: 01 March 2013

      Author Tags

      1. Data complexity
      2. Learner assessment
      3. Supervised learning

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 22 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Classifier selection using geometry preserving featureNeural Computing and Applications10.1007/s00521-023-08828-y35:28(20955-20976)Online publication date: 28-Jul-2023
      • (2021)wCM based hybrid pre-processing algorithm for class imbalanced datasetJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-21062441:2(3339-3354)Online publication date: 1-Jan-2021
      • (2021)Classifying multiclass imbalanced data using generalized class-specific extreme learning machineProgress in Artificial Intelligence10.1007/s13748-021-00236-410:3(259-281)Online publication date: 1-Sep-2021
      • (2020)Weighted k‐nearest neighbor based data complexity metrics for imbalanced datasetsStatistical Analysis and Data Mining10.1002/sam.1146313:4(394-404)Online publication date: 2-Jul-2020
      • (2019)How Complex Is Your Classification Problem?ACM Computing Surveys10.1145/334771152:5(1-34)Online publication date: 13-Sep-2019
      • (2018)A framework for dynamic classifier selection oriented by the classification problem difficultyPattern Recognition10.1016/j.patcog.2017.10.03876:C(175-190)Online publication date: 1-Apr-2018
      • (2018)Automatic Classifier Selection Based on Classification ComplexityPattern Recognition and Computer Vision10.1007/978-3-030-03338-5_25(292-303)Online publication date: 23-Nov-2018
      • (2017)Complexity vs. performanceProceedings of the 2017 Internet Measurement Conference10.1145/3131365.3131372(384-397)Online publication date: 1-Nov-2017
      • (2017)Centralized vs. distributed feature selection methods based on data complexity measuresKnowledge-Based Systems10.1016/j.knosys.2016.09.022117:C(27-45)Online publication date: 1-Feb-2017
      • (2017)Can classification performance be predicted by complexity measures? A study using microarray dataKnowledge and Information Systems10.1007/s10115-016-1003-351:3(1067-1090)Online publication date: 1-Jun-2017
      • Show More Cited By

      View Options

      View options

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media