article

Towards UCI+: A mindful repository design

Authors:

Ester Bernadó-MansillaAuthors Info & Claims

Information Sciences—Informatics and Computer Science, Intelligent Systems, Applications: An International Journal, Volume 261

Pages 237 - 262

https://doi.org/10.1016/j.ins.2013.08.059

Published: 01 March 2014 Publication History

Abstract

Public repositories have contributed to the maturation of experimental methodology in machine learning. Publicly available data sets have allowed researchers to empirically assess their learners and, jointly with open source machine learning software, they have favoured the emergence of comparative analyses of learners' performance over a common framework. These studies have brought standard procedures to evaluate machine learning techniques. However, current claims-such as the superiority of enhanced algorithms-are biased by unsustained assumptions made throughout some praxes. In this paper, the early steps of the methodology, which refer to data set selection, are inspected. Particularly, the exploitation of the most popular data repository in machine learning-the UCI repository-is examined. We analyse the type, complexity, and use of UCI data sets. The study recommends the design of a mindful data repository, UCI+, which should include a set of properly characterised data sets consisting of a complete and representative sample of real-world problems, enriched with artificial benchmarks. The ultimate goal of the UCI+ is to lay the foundations towards a well-supported methodology for learner assessment.

References

[1]

Metal: A meta-learning assistant for providing user support in machine learning and data mining, 1998.

Google Scholar

[2]

Alcala-Fdez, J., Sánchez, L., García, S., del Jesus, M.J., Ventura, S., Garrell, J.M., Otero, J., Romero, C., Bacardit, J., Rivas, V., Fernández, J. and Herrera, F., Keel: A software tool to assess evolutionary algorithms for data mining problems. Soft Computing. v13. 307-318.

Crossref

Google Scholar

[3]

Bernadó-Mansilla, E. and Ho, T.K., Domain of competence of XCS classifier system in complexity measurement space. IEEE Transactions on Evolutionary Computation. v9. 82-104.

Crossref

Google Scholar

[4]

P. Brazdil, J. Gama, B. Henery, Characterizing the applicability of classification algorithms using meta-level learning, in: Proceedings of the European Conference on Machine Learning, 1994, pp. 83-102.

Digital Library

Google Scholar

[5]

Castiello, C., Castellano, G. and Fanelli, A.M., MINDFUL: A framework for meta-inductive neuro-fuzzy learning. Information Sciences. v178. 3253-3274.

Crossref

Google Scholar

[6]

Deb, K.D., Pratap, A., Agarwal, S. and Meyarivan, T., A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation. v6. 182-197.

Crossref

Google Scholar

[7]

Demšar, J., Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research. v7. 1-30.

Crossref

Google Scholar

[8]

Dietterich, T.G., Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation. v10. 1895-1924.

Crossref

Google Scholar

[9]

Duin, R.P.W., Loog, M., Pekalska, E. and Taz, D.M.J., Feature-based dissimilarity space classification. In: Lecture Note in Computer Science, vol. 6388. Springer. pp. 46-55.

Crossref

Google Scholar

[10]

Fisher, R.A., The use of multiple measurements in taxonomic problems. Annals of Eugenics. v7. 179-188.

Google Scholar

[11]

A. Frank, A. Asuncion, UCI machine learning repository, 2010.

Google Scholar

[12]

García, S. and Herrera, F., An extension on "statistical comparisons of classifiers over multiple data sets" for all pairwise comparisons. Journal of Machine Learning Research. v9. 2677-2694.

Google Scholar

[13]

van der Heijden, F., Duin, R.P.W., de Ridder, D. and Tax, D.M.J., . 2004. John Wiley & Sons.

Google Scholar

[14]

Ho, T.K. and Basu, M., Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence. v24. 289-300.

Crossref

Google Scholar

[15]

Very simple classification rules perform well on most commonly used datasets. Machine Learning. v11. 63-90.

Crossref

Google Scholar

[16]

Jankowski, N. and Gra¿bczewski, K., Universal meta-learning architecture and algorithms. In: Studies in Computational Intelligence, vol. 358. Springer, Berlin Heidelberg. pp. 1-76.

Google Scholar

[17]

Kohavi, R., Sommerfield, D. and Dougherty, J., Data mining using MLC++, a machine learning library in C++. International Journal on Artificial Intelligence Tools. v6. 537-566.

Google Scholar

[18]

Kolmogorov, A.N., Three approaches to the quantitative definition of information. Problems in Information Transmission. v1. 1-7.

Google Scholar

[19]

Krasnogor, N. and Pelta, D.A., Measuring the similarity of protein structures by means of the universal similarity metric. Bioinformatics. v20. 1015-1021.

Crossref

Google Scholar

[20]

Langley, P., The changing science of machine learning. Machine Learning. v82. 275-279.

Crossref

Google Scholar

[21]

A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning. v40. 203-229.

Crossref

Google Scholar

[22]

Luengo, J., Fernández, A., García, S. and Herrera, F., Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft Computing.

Crossref

Google Scholar

[23]

N. Macií, Data complexity in supervised learning: a far-reaching implication, Ph.D. thesis, La Salle - Universitat Ramon Llull, 2011.

Google Scholar

[24]

Macií, N., Bernadó-Mansilla, E. and Orriols-Puig, A., On the dimensions of data complexity through synthetic data sets. In: Lecture Notes in Computer Science, vol. 184. IOS Press. pp. 244-252.

Digital Library

Google Scholar

[25]

Macií, N., Bernadó-Mansilla, E., Orriols-Puig, A. and Ho, T.K., Learner excellence biased by data set selection: a case for data characterisation and artificial data sets. Pattern Recognition. v46. 1054-1066.

Crossref

Google Scholar

[26]

Macií, N., Orriols-Puig, A. and Bernadó-Mansilla, E., In search of targeted-complexity problems. In: Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, ACM. pp. 1055-1062.

Crossref

Google Scholar

[27]

Maciejowski, J.M., Model discrimination using an algorithmic information criterion. Automatica. v15. 579-593.

Crossref

Google Scholar

[28]

. In: Michie, D., Spiegelhalter, D.J., Taylor, C., Campbell, J. (Eds.), Machine Learning, Neural and Statistical Classification, Ellis Horwood, Upper Saddle River, NJ, USA.

Crossref

Google Scholar

[29]

A. Orriols-Puig, N. Macií, T.K. Ho, Documentation for the data complexity library in C++, Technical Report, La Salle - Universitat Ramon Llull, 2010.

Google Scholar

[30]

Y. Peng, P.A. Flach, C. Soares, P. Brazdil, Improved dataset characterisation for meta-learning, in: Proceedings of the 5th International Conference on Discovery Science, 2002, pp. 141-152.

Digital Library

Google Scholar

[31]

L. Prechelt, PROBEN 1-A set of benchmarks and benchmarking rules for neural network training algorithms, Technical Report, Universität Karlsruhe, Fakultat fur Informatik, 1994.

Google Scholar

[32]

Saitta, L. and Neri, F., Learning in the "real world". Machine Learning. v30. 133-163.

Crossref

Google Scholar

[33]

Salzberg, S.L., On comparing classifiers: pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery. v1. 317-328.

Crossref

Google Scholar

[34]

Sánchez, J.S., Mollineda, R.A. and Sotoca, J.M., An analysis of how training data complexity affects the nearest neighbor classifiers. Pattern Analysis and Applications. v10. 189-201.

Crossref

Google Scholar

[35]

Solomonoff, R.J., the universal distribution, The kolmogorov lecture. and learning, machine, . The Journal Computer. v66. 598-601.

Google Scholar

[36]

R. Vilalta, C.G. Giraud-Carrier, P. Brazdil, Meta-learning - concepts and techniques, in: Data Mining and Knowledge Discovery Handbook, 2010, pp. 717-731.

Google Scholar

[37]

Witten, I.H. and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques. 2005. second ed. Morgan Kaufman, San Francisco.

Crossref

Google Scholar

[38]

Wolpert, D.H., The lack of a priori distinctions between learning algorithms. Neural Computation. v8. 1341-1390.

Crossref

Google Scholar

Cited By

View all

Pereira JSmith-Miles KMuñoz MLorena A(2024)Optimal selection of benchmarking datasets for unbiased machine learning algorithm evaluationData Mining and Knowledge Discovery10.1007/s10618-023-00957-138:2(461-500)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s10618-023-00957-1
Fachada Nde Andrade D(2023)Generating multidimensional clusters with support linesKnowledge-Based Systems10.1016/j.knosys.2023.110836277:COnline publication date: 9-Oct-2023
https://dl.acm.org/doi/10.1016/j.knosys.2023.110836
Ishfaq UShabbir DKhan JKhan HNaseer SIrshad AShafiq MHamam H(2022)Empirical Analysis of Machine Learning Algorithms for Multiclass PredictionWireless Communications & Mobile Computing10.1155/2022/74511522022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/7451152
Show More Cited By

Index Terms

Towards UCI+: A mindful repository design
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Data management systems
    1. Information integration
      1. Data warehouses
  2. Information systems applications
    1. Decision support systems
      1. Data warehouses

Recommendations

Collaboration using OSSmole: a repository of FLOSS data and analyses
MSR '05: Proceedings of the 2005 international workshop on Mining software repositories

This paper introduces a collaborative project OSSmole which collects, shares, and stores comparable data and analyses of free, libre and open source software (FLOSS) development for research purposes. The project is a clearinghouse for data from the ...
Collaboration using OSSmole: a repository of FLOSS data and analyses

This paper introduces a collaborative project OSSmole which collects, shares, and stores comparable data and analyses of free, libre and open source software (FLOSS) development for research purposes. The project is a clearinghouse for data from the ...
Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling
Special Issue on Intelligent Systems, Design and Applications (ISDA 2009)

In the classification framework there are problems in which the number of examples per class is not equitably distributed, formerly known as imbalanced data sets. This situation is a handicap when trying to identify the minority classes, as the learning ...

Reviews

Reviewer: Jose Hernandez-Orallo

The evaluation of machine learning algorithms has always been controversial. What datasets, experimental setting, and statistical tests must be chosen__?__ Dataset repositories such as UCI have been enormously handy for machine learning research. However, the selection of datasets is commonly careless (when not cherry-picked). This paper is not the first criticism about the way these datasets are used, but it is the most comprehensive, insightful, and constructive so far. The use of complexity measures and a characterization of the UCI datasets are most welcome. Nonetheless, the analysis could have been more complete with complexity measures derived from (approximations of) Kolmogorov complexity and with other performance metrics (only accuracy is used). Despite the implausibility of the assumptions of the no-free-lunch theorem, it pervades the authors' notion of diversity. More diversity does not mean that problems should cover all ranges of error and complexity measures in a uniform way. More effort should be done to clarify what a “representative” sample of real problems is if we want to assess whether the UCI repository is diverse and “challenging” enough. The authors also present a basic dataset generator based on injecting distortion (a pattern-based generator would possibly be a better option). Anyway, the use of dataset generators jointly with a more regulated and automated evaluation procedure is the way to go. This paper should not only contribute to a debate in the community, but it should also become a must-read for everyone using the UCI datasets for the evaluation of machine learning algorithms. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 March 2014

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Pereira JSmith-Miles KMuñoz MLorena A(2024)Optimal selection of benchmarking datasets for unbiased machine learning algorithm evaluationData Mining and Knowledge Discovery10.1007/s10618-023-00957-138:2(461-500)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s10618-023-00957-1
Fachada Nde Andrade D(2023)Generating multidimensional clusters with support linesKnowledge-Based Systems10.1016/j.knosys.2023.110836277:COnline publication date: 9-Oct-2023
https://dl.acm.org/doi/10.1016/j.knosys.2023.110836
Ishfaq UShabbir DKhan JKhan HNaseer SIrshad AShafiq MHamam H(2022)Empirical Analysis of Machine Learning Algorithms for Multiclass PredictionWireless Communications & Mobile Computing10.1155/2022/74511522022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/7451152
Smith-Miles KMuñoz M(2022)Instance Space Analysis for Algorithm Testing: Methodology and Software ToolsACM Computing Surveys10.1145/357289555:12(1-31)Online publication date: 30-Nov-2022
https://dl.acm.org/doi/10.1145/3572895
Santos MAbreu PJapkowicz NFernández ASoares CWilk SSantos J(2022)On the joint-effect of class imbalance and overlap: a critical reviewArtificial Intelligence Review10.1007/s10462-022-10150-355:8(6207-6275)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1007/s10462-022-10150-3
Fraça TMiranda PPrudêncio RLorenaz ANascimento A(2020)A Many-Objective optimization Approach for Complexity-based Data set Generation2020 IEEE Congress on Evolutionary Computation (CEC)10.1109/CEC48606.2020.9185543(1-8)Online publication date: 19-Jul-2020
https://dl.acm.org/doi/10.1109/CEC48606.2020.9185543
Arruda JPrudêncio RLorena A(2020)Measuring Instance Hardness Using Data Complexity MeasuresIntelligent Systems10.1007/978-3-030-61380-8_33(483-497)Online publication date: 20-Oct-2020
https://dl.acm.org/doi/10.1007/978-3-030-61380-8_33
Lorena AGarcia LLehmann JSouto MHo T(2019)How Complex Is Your Classification Problem?ACM Computing Surveys10.1145/334771152:5(1-34)Online publication date: 13-Sep-2019
https://dl.acm.org/doi/10.1145/3347711
Shand CAllmendinger RHandl JWebb AKeane JAuger AStützle T(2019)Evolving controllably difficult datasets for clusteringProceedings of the Genetic and Evolutionary Computation Conference10.1145/3321707.3321761(463-471)Online publication date: 13-Jul-2019
https://dl.acm.org/doi/10.1145/3321707.3321761
Muñoz MVillanova LBaatar DSmith-Miles K(2018)Instance spaces for machine learning classificationMachine Language10.1007/s10994-017-5629-5107:1(109-147)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.1007/s10994-017-5629-5
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Index Terms

Recommendations

Collaboration using OSSmole: a repository of FLOSS data and analyses

Collaboration using OSSmole: a repository of FLOSS data and analyses

Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations