Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3430984.3430997acmotherconferencesArticle/Chapter ViewAbstractPublication PagescodsConference Proceedingsconference-collections
research-article

Is it hard to learn a classifier on this dataset?

Published: 02 January 2021 Publication History

Abstract

Identifying how hard it is to achieve a good classification performance on a given dataset can be useful in data analysis, model selection, and meta-learning. We hypothesize that the dataset clustering indices which capture the characteristics of a dataset are related to the respective classification complexity. In this work, we propose a method for determining the empirical classification complexity of a dataset based on its clustering indices. We model this mapping problem as a supervised classification task where the estimated clustering indices of a given dataset form the features and with an indicator variable representing its classification complexity as the label. For the experiments, we use a set of clustering and classification algorithms spanning different modeling assumptions. To test whether the given dataset is complex, we estimate its clustering indices and feed it to the trained complexity classifier to output the prediction. Our approach is simple, but very effective and robust across many datasets and classifiers. We evaluate our method using 60 publicly available datasets.

References

[1]
Giuliano Armano and Emanuele Tamponi. 2015. Experimenting multiresolution analysis for identifying regions of different classification complexity. Pattern Analysis and Applications 19 (01 2015). https://doi.org/10.1007/s10044-014-0446-y
[2]
Michael R. Chernick and Robert A. LaBudde. 2011. An Introduction to Bootstrap Methods with Applications to R (1st ed.). Wiley Publishing, USA.
[3]
D. L. Davies and D. W. Bouldin. 1979. A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-1, 2 (April 1979), 224–227. https://doi.org/10.1109/TPAMI.1979.4766909
[4]
Bernard Desgraupes. 2013. Clustering Indices. https://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf. https://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf
[5]
J. C. Dunn†. 1974. Well-Separated Clusters and Optimal Fuzzy Partitions. Journal of Cybernetics 4, 1 (1974), 95–104. https://doi.org/10.1080/01969727408546059 arXiv:https://doi.org/10.1080/01969727408546059
[6]
Jane Elith, J Leathwick, and Trevor Hastie. 2008. A Working Guide to Boosted Regression Trees. The Journal of animal ecology 77 (08 2008), 802–13. https://doi.org/10.1111/j.1365-2656.2008.01390.x
[7]
Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, and Frank Hutter. 2019. Auto-sklearn: Efficient and Robust Automated Machine Learning. Springer International Publishing, Cham, 113–134. https://doi.org/10.1007/978-3-030-05318-5_6
[8]
Luís Paulo Garcia, Andre de Carvalho, and Ana Lorena. 2015. Effect of label noise in the complexity of classification problems. Neurocomputing 160 (02 2015). https://doi.org/10.1016/j.neucom.2014.10.085
[9]
Tin Ho and Mitra Basu. 2002. Complexity Measures of Supervised Classification Problems. IEEE Trans. Pattern Anal. Mach. Intell. 24 (03 2002), 289–300. https://doi.org/10.1109/34.990132
[10]
Tin Kam Ho, Mitra Basu, and Martin Hiu Chung Law. 2006. Measures of Geometrical Complexity in Classification Problems. Springer London, London, 1–23. https://doi.org/10.1007/978-1-84628-172-3_1
[11]
Aarnoud Hoekstra and Robert P. W. Duin. 1996. On the nonlinearity of pattern classifiers. Proceedings of 13th International Conference on Pattern Recognition 4 (1996), 271–275 vol.4.
[12]
Lawrence Hubert and James Schultz. 1976. QUADRATIC ASSIGNMENT AS A GENERAL DATA ANALYSIS STRATEGY. Brit. J. Math. Statist. Psych. 29, 2 (1976), 190–241. https://doi.org/10.1111/j.2044-8317.1976.tb00714.x arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.2044-8317.1976.tb00714.x
[13]
Gilad Katz, Eui Chul Richard Shin, and Dawn Xiaodong Song. 2016. ExploreKit: Automatic Feature Generation and Selection. 2016 IEEE 16th International Conference on Data Mining (ICDM) 1 (2016), 979–984.
[14]
Ling Li and Yaser Abu-mostafa. 2006. Data complexity in machine learning. Caltech Computer Science Technical Report 1 (2006).
[15]
Ana Lorena, Ivan Costa, Newton Spolaôr, and Marcilio de Souto. 2012. Analysis of complexity indices for classification problems: Cancer gene expression data. Neurocomputing 75 (01 2012), 33–42. https://doi.org/10.1016/j.neucom.2011.03.054
[16]
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin Kam Ho. 2019. How Complex Is Your Classification Problem? A Survey on Measuring Classification Complexity. ACM Comput. Surv. 52, 5, Article 107 (Sept. 2019), 34 pages. https://doi.org/10.1145/3347711
[17]
Martin Maechler, Peter Rousseeuw, Anja Struyf, Mia Hubert, and Kurt Hornik. 2019. cluster: Cluster Analysis Basics and Extensions. CRAN 1(2019), 82 pages. R package version 2.0.8.
[18]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK. http://nlp.stanford.edu/IR-book/information-retrieval-book.html
[19]
Ramón A. Mollineda, J. Salvador Sánchez, and José M. Sotoca. 2005. Data Characterization for Effective Prototype Selection. In Pattern Recognition and Image Analysis, Jorge S. Marques, Nicolás Pérez de la Blanca, and Pedro Pina (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 27–34.
[20]
Gleison Morais and Ronaldo Prati. 2013. Complex Network Measures for Data Set Characterization, In 2013 Brazilian Conference on Intelligent Systems. Proceedings - 2013 Brazilian Conference on Intelligent Systems, BRACIS 2013 1, 12–18. https://doi.org/10.1109/BRACIS.2013.11
[21]
D. Mukunthu, P. Shah, and W.H. Tok. 2019. Practical Automated Machine Learning on Azure: Using Azure Machine Learning to Quickly Build AI Solutions. O’Reilly Media, Incorporated, USA. https://books.google.co.in/books?id=CgB4xgEACAAJ
[22]
A Orriols-Puig, Núria Macià, and Tin Ho. 2010. DCoL: Data Complexity Library in C++ (Documentation). SourceForge 1 (01 2010).
[23]
C. Radhakrishna Rao. 1948. The Utilization of Multiple Measurements in Problems of Biological Classification. Journal of the Royal Statistical Society: Series B (Methodological) 10, 2(1948), 159–193. https://doi.org/10.1111/j.2517-6161.1948.tb00008.x arXiv:https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.2517-6161.1948.tb00008.x
[24]
Andrew Rosenberg and Julia Hirschberg. 2007. V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Association for Computational Linguistics, Prague, Czech Republic, 410–420. https://www.aclweb.org/anthology/D07-1043
[25]
Peter Rousseeuw. 1987. Rousseeuw, P.J.: Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Comput. Appl. Math. 20, 53-65. J. Comput. Appl. Math. 20 (11 1987), 53–65. https://doi.org/10.1016/0377-0427(87)90125-7
[26]
B. Seijo-Pardo, V. Bolón-Canedo, and A. Alonso-Betanzos. 2019. On developing an automatic threshold applied to feature selection ensembles. Information Fusion 45(2019), 227 – 245. https://doi.org/10.1016/j.inffus.2018.02.007
[27]
Ajay Kumar Tanwani and Muddassar Farooq. 2010. Classification Potential vs. Classification Accuracy: A Comprehensive Study of Evolutionary Algorithms with Biomedical Datasets. In Learning Classifier Systems, Jaume Bacardit, Will Browne, Jan Drugowitsch, Ester Bernadó-Mansilla, and Martin V. Butz (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 127–144.
[28]
Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2013. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, ACM, Univerity of British Columbia, Vancouver, 847–855. https://arxiv.org/pdf/1208.3719
[29]
Roman Vainshtein, Asnat Greenstein-Messica, Gilad Katz, Bracha Shapira, and Lior Rokach. 2018. A Hybrid Approach for Automatic Model Recommendation. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (Torino, Italy) (CIKM ’18). Association for Computing Machinery, New York, NY, USA, 1623–1626. https://doi.org/10.1145/3269206.3269299

Cited By

View all
  • (2023)CIAMS: clustering indices-based automatic classification model selectionInternational Journal of Data Science and Analytics10.1007/s41060-023-00441-5Online publication date: 19-Aug-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
CODS-COMAD '21: Proceedings of the 3rd ACM India Joint International Conference on Data Science & Management of Data (8th ACM IKDD CODS & 26th COMAD)
January 2021
453 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 January 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Automated Machine Learning
  2. Classification Complexity
  3. Cluster Analysis
  4. Meta-learning
  5. Model Selection

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • Claritrics Inc

Conference

CODS COMAD 2021
CODS COMAD 2021: 8th ACM IKDD CODS and 26th COMAD
January 2 - 4, 2021
Bangalore, India

Acceptance Rates

Overall Acceptance Rate 197 of 680 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)CIAMS: clustering indices-based automatic classification model selectionInternational Journal of Data Science and Analytics10.1007/s41060-023-00441-5Online publication date: 19-Aug-2023

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media