Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Towards UCI+: A mindful repository design

Published: 01 March 2014 Publication History

Abstract

Public repositories have contributed to the maturation of experimental methodology in machine learning. Publicly available data sets have allowed researchers to empirically assess their learners and, jointly with open source machine learning software, they have favoured the emergence of comparative analyses of learners' performance over a common framework. These studies have brought standard procedures to evaluate machine learning techniques. However, current claims-such as the superiority of enhanced algorithms-are biased by unsustained assumptions made throughout some praxes. In this paper, the early steps of the methodology, which refer to data set selection, are inspected. Particularly, the exploitation of the most popular data repository in machine learning-the UCI repository-is examined. We analyse the type, complexity, and use of UCI data sets. The study recommends the design of a mindful data repository, UCI+, which should include a set of properly characterised data sets consisting of a complete and representative sample of real-world problems, enriched with artificial benchmarks. The ultimate goal of the UCI+ is to lay the foundations towards a well-supported methodology for learner assessment.

References

[1]
Metal: A meta-learning assistant for providing user support in machine learning and data mining, 1998.
[2]
Alcala-Fdez, J., Sánchez, L., García, S., del Jesus, M.J., Ventura, S., Garrell, J.M., Otero, J., Romero, C., Bacardit, J., Rivas, V., Fernández, J. and Herrera, F., Keel: A software tool to assess evolutionary algorithms for data mining problems. Soft Computing. v13. 307-318.
[3]
Bernadó-Mansilla, E. and Ho, T.K., Domain of competence of XCS classifier system in complexity measurement space. IEEE Transactions on Evolutionary Computation. v9. 82-104.
[4]
P. Brazdil, J. Gama, B. Henery, Characterizing the applicability of classification algorithms using meta-level learning, in: Proceedings of the European Conference on Machine Learning, 1994, pp. 83-102.
[5]
Castiello, C., Castellano, G. and Fanelli, A.M., MINDFUL: A framework for meta-inductive neuro-fuzzy learning. Information Sciences. v178. 3253-3274.
[6]
Deb, K.D., Pratap, A., Agarwal, S. and Meyarivan, T., A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation. v6. 182-197.
[7]
Demšar, J., Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research. v7. 1-30.
[8]
Dietterich, T.G., Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation. v10. 1895-1924.
[9]
Duin, R.P.W., Loog, M., Pekalska, E. and Taz, D.M.J., Feature-based dissimilarity space classification. In: Lecture Note in Computer Science, vol. 6388. Springer. pp. 46-55.
[10]
Fisher, R.A., The use of multiple measurements in taxonomic problems. Annals of Eugenics. v7. 179-188.
[11]
A. Frank, A. Asuncion, UCI machine learning repository, 2010.
[12]
García, S. and Herrera, F., An extension on "statistical comparisons of classifiers over multiple data sets" for all pairwise comparisons. Journal of Machine Learning Research. v9. 2677-2694.
[13]
van der Heijden, F., Duin, R.P.W., de Ridder, D. and Tax, D.M.J., . 2004. John Wiley & Sons.
[14]
Ho, T.K. and Basu, M., Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence. v24. 289-300.
[15]
Very simple classification rules perform well on most commonly used datasets. Machine Learning. v11. 63-90.
[16]
Jankowski, N. and Gra¿bczewski, K., Universal meta-learning architecture and algorithms. In: Studies in Computational Intelligence, vol. 358. Springer, Berlin Heidelberg. pp. 1-76.
[17]
Kohavi, R., Sommerfield, D. and Dougherty, J., Data mining using MLC++, a machine learning library in C++. International Journal on Artificial Intelligence Tools. v6. 537-566.
[18]
Kolmogorov, A.N., Three approaches to the quantitative definition of information. Problems in Information Transmission. v1. 1-7.
[19]
Krasnogor, N. and Pelta, D.A., Measuring the similarity of protein structures by means of the universal similarity metric. Bioinformatics. v20. 1015-1021.
[20]
Langley, P., The changing science of machine learning. Machine Learning. v82. 275-279.
[21]
A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning. v40. 203-229.
[22]
Luengo, J., Fernández, A., García, S. and Herrera, F., Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft Computing.
[23]
N. Macií, Data complexity in supervised learning: a far-reaching implication, Ph.D. thesis, La Salle - Universitat Ramon Llull, 2011.
[24]
Macií, N., Bernadó-Mansilla, E. and Orriols-Puig, A., On the dimensions of data complexity through synthetic data sets. In: Lecture Notes in Computer Science, vol. 184. IOS Press. pp. 244-252.
[25]
Macií, N., Bernadó-Mansilla, E., Orriols-Puig, A. and Ho, T.K., Learner excellence biased by data set selection: a case for data characterisation and artificial data sets. Pattern Recognition. v46. 1054-1066.
[26]
Macií, N., Orriols-Puig, A. and Bernadó-Mansilla, E., In search of targeted-complexity problems. In: Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, ACM. pp. 1055-1062.
[27]
Maciejowski, J.M., Model discrimination using an algorithmic information criterion. Automatica. v15. 579-593.
[28]
. In: Michie, D., Spiegelhalter, D.J., Taylor, C., Campbell, J. (Eds.), Machine Learning, Neural and Statistical Classification, Ellis Horwood, Upper Saddle River, NJ, USA.
[29]
A. Orriols-Puig, N. Macií, T.K. Ho, Documentation for the data complexity library in C++, Technical Report, La Salle - Universitat Ramon Llull, 2010.
[30]
Y. Peng, P.A. Flach, C. Soares, P. Brazdil, Improved dataset characterisation for meta-learning, in: Proceedings of the 5th International Conference on Discovery Science, 2002, pp. 141-152.
[31]
L. Prechelt, PROBEN 1-A set of benchmarks and benchmarking rules for neural network training algorithms, Technical Report, Universität Karlsruhe, Fakultat fur Informatik, 1994.
[32]
Saitta, L. and Neri, F., Learning in the "real world". Machine Learning. v30. 133-163.
[33]
Salzberg, S.L., On comparing classifiers: pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery. v1. 317-328.
[34]
Sánchez, J.S., Mollineda, R.A. and Sotoca, J.M., An analysis of how training data complexity affects the nearest neighbor classifiers. Pattern Analysis and Applications. v10. 189-201.
[35]
Solomonoff, R.J., the universal distribution, The kolmogorov lecture. and learning, machine, . The Journal Computer. v66. 598-601.
[36]
R. Vilalta, C.G. Giraud-Carrier, P. Brazdil, Meta-learning - concepts and techniques, in: Data Mining and Knowledge Discovery Handbook, 2010, pp. 717-731.
[37]
Witten, I.H. and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques. 2005. second ed. Morgan Kaufman, San Francisco.
[38]
Wolpert, D.H., The lack of a priori distinctions between learning algorithms. Neural Computation. v8. 1341-1390.

Cited By

View all
  • (2024)Optimal selection of benchmarking datasets for unbiased machine learning algorithm evaluationData Mining and Knowledge Discovery10.1007/s10618-023-00957-138:2(461-500)Online publication date: 1-Mar-2024
  • (2023)Generating multidimensional clusters with support linesKnowledge-Based Systems10.1016/j.knosys.2023.110836277:COnline publication date: 9-Oct-2023
  • (2022)Empirical Analysis of Machine Learning Algorithms for Multiclass PredictionWireless Communications & Mobile Computing10.1155/2022/74511522022Online publication date: 1-Jan-2022
  • Show More Cited By

Recommendations

Reviews

Jose Hernandez-Orallo

The evaluation of machine learning algorithms has always been controversial. What datasets, experimental setting, and statistical tests must be chosen__?__ Dataset repositories such as UCI have been enormously handy for machine learning research. However, the selection of datasets is commonly careless (when not cherry-picked). This paper is not the first criticism about the way these datasets are used, but it is the most comprehensive, insightful, and constructive so far. The use of complexity measures and a characterization of the UCI datasets are most welcome. Nonetheless, the analysis could have been more complete with complexity measures derived from (approximations of) Kolmogorov complexity and with other performance metrics (only accuracy is used). Despite the implausibility of the assumptions of the no-free-lunch theorem, it pervades the authors' notion of diversity. More diversity does not mean that problems should cover all ranges of error and complexity measures in a uniform way. More effort should be done to clarify what a “representative” sample of real problems is if we want to assess whether the UCI repository is diverse and “challenging” enough. The authors also present a basic dataset generator based on injecting distortion (a pattern-based generator would possibly be a better option). Anyway, the use of dataset generators jointly with a more regulated and automated evaluation procedure is the way to go. This paper should not only contribute to a debate in the community, but it should also become a must-read for everyone using the UCI datasets for the evaluation of machine learning algorithms. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 March 2014

Author Tags

  1. Classification
  2. Data complexity
  3. Data repository
  4. Synthetic data set

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Optimal selection of benchmarking datasets for unbiased machine learning algorithm evaluationData Mining and Knowledge Discovery10.1007/s10618-023-00957-138:2(461-500)Online publication date: 1-Mar-2024
  • (2023)Generating multidimensional clusters with support linesKnowledge-Based Systems10.1016/j.knosys.2023.110836277:COnline publication date: 9-Oct-2023
  • (2022)Empirical Analysis of Machine Learning Algorithms for Multiclass PredictionWireless Communications & Mobile Computing10.1155/2022/74511522022Online publication date: 1-Jan-2022
  • (2022)Instance Space Analysis for Algorithm Testing: Methodology and Software ToolsACM Computing Surveys10.1145/357289555:12(1-31)Online publication date: 30-Nov-2022
  • (2022)On the joint-effect of class imbalance and overlap: a critical reviewArtificial Intelligence Review10.1007/s10462-022-10150-355:8(6207-6275)Online publication date: 1-Dec-2022
  • (2020)A Many-Objective optimization Approach for Complexity-based Data set Generation2020 IEEE Congress on Evolutionary Computation (CEC)10.1109/CEC48606.2020.9185543(1-8)Online publication date: 19-Jul-2020
  • (2020)Measuring Instance Hardness Using Data Complexity MeasuresIntelligent Systems10.1007/978-3-030-61380-8_33(483-497)Online publication date: 20-Oct-2020
  • (2019)How Complex Is Your Classification Problem?ACM Computing Surveys10.1145/334771152:5(1-34)Online publication date: 13-Sep-2019
  • (2019)Evolving controllably difficult datasets for clusteringProceedings of the Genetic and Evolutionary Computation Conference10.1145/3321707.3321761(463-471)Online publication date: 13-Jul-2019
  • (2018)Instance spaces for machine learning classificationMachine Language10.1007/s10994-017-5629-5107:1(109-147)Online publication date: 1-Jan-2018
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media