research-article

Is it hard to learn a classifier on this dataset?

Authors:

Sudarsun Santhiappan,

Balaraman RavindranAuthors Info & Claims

CODS-COMAD '21: Proceedings of the 3rd ACM India Joint International Conference on Data Science & Management of Data (8th ACM IKDD CODS & 26th COMAD)

Pages 299 - 306

https://doi.org/10.1145/3430984.3430997

Published: 02 January 2021 Publication History

Abstract

Identifying how hard it is to achieve a good classification performance on a given dataset can be useful in data analysis, model selection, and meta-learning. We hypothesize that the dataset clustering indices which capture the characteristics of a dataset are related to the respective classification complexity. In this work, we propose a method for determining the empirical classification complexity of a dataset based on its clustering indices. We model this mapping problem as a supervised classification task where the estimated clustering indices of a given dataset form the features and with an indicator variable representing its classification complexity as the label. For the experiments, we use a set of clustering and classification algorithms spanning different modeling assumptions. To test whether the given dataset is complex, we estimate its clustering indices and feed it to the trained complexity classifier to output the prediction. Our approach is simple, but very effective and robust across many datasets and classifiers. We evaluate our method using 60 publicly available datasets.

References

[1]

Giuliano Armano and Emanuele Tamponi. 2015. Experimenting multiresolution analysis for identifying regions of different classification complexity. Pattern Analysis and Applications 19 (01 2015). https://doi.org/10.1007/s10044-014-0446-y

Digital Library

[2]

Michael R. Chernick and Robert A. LaBudde. 2011. An Introduction to Bootstrap Methods with Applications to R (1st ed.). Wiley Publishing, USA.

[3]

D. L. Davies and D. W. Bouldin. 1979. A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-1, 2 (April 1979), 224–227. https://doi.org/10.1109/TPAMI.1979.4766909

Digital Library

[4]

Bernard Desgraupes. 2013. Clustering Indices. https://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf. https://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf

[5]

J. C. Dunn†. 1974. Well-Separated Clusters and Optimal Fuzzy Partitions. Journal of Cybernetics 4, 1 (1974), 95–104. https://doi.org/10.1080/01969727408546059 arXiv:https://doi.org/10.1080/01969727408546059

[6]

Jane Elith, J Leathwick, and Trevor Hastie. 2008. A Working Guide to Boosted Regression Trees. The Journal of animal ecology 77 (08 2008), 802–13. https://doi.org/10.1111/j.1365-2656.2008.01390.x

[7]

Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, and Frank Hutter. 2019. Auto-sklearn: Efficient and Robust Automated Machine Learning. Springer International Publishing, Cham, 113–134. https://doi.org/10.1007/978-3-030-05318-5_6

[8]

Luís Paulo Garcia, Andre de Carvalho, and Ana Lorena. 2015. Effect of label noise in the complexity of classification problems. Neurocomputing 160 (02 2015). https://doi.org/10.1016/j.neucom.2014.10.085

Digital Library

[9]

Tin Ho and Mitra Basu. 2002. Complexity Measures of Supervised Classification Problems. IEEE Trans. Pattern Anal. Mach. Intell. 24 (03 2002), 289–300. https://doi.org/10.1109/34.990132

Digital Library

[10]

Tin Kam Ho, Mitra Basu, and Martin Hiu Chung Law. 2006. Measures of Geometrical Complexity in Classification Problems. Springer London, London, 1–23. https://doi.org/10.1007/978-1-84628-172-3_1

[11]

Aarnoud Hoekstra and Robert P. W. Duin. 1996. On the nonlinearity of pattern classifiers. Proceedings of 13th International Conference on Pattern Recognition 4 (1996), 271–275 vol.4.

[12]

Lawrence Hubert and James Schultz. 1976. QUADRATIC ASSIGNMENT AS A GENERAL DATA ANALYSIS STRATEGY. Brit. J. Math. Statist. Psych. 29, 2 (1976), 190–241. https://doi.org/10.1111/j.2044-8317.1976.tb00714.x arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.2044-8317.1976.tb00714.x

[13]

Gilad Katz, Eui Chul Richard Shin, and Dawn Xiaodong Song. 2016. ExploreKit: Automatic Feature Generation and Selection. 2016 IEEE 16th International Conference on Data Mining (ICDM) 1 (2016), 979–984.

[14]

Ling Li and Yaser Abu-mostafa. 2006. Data complexity in machine learning. Caltech Computer Science Technical Report 1 (2006).

[15]

Ana Lorena, Ivan Costa, Newton Spolaôr, and Marcilio de Souto. 2012. Analysis of complexity indices for classification problems: Cancer gene expression data. Neurocomputing 75 (01 2012), 33–42. https://doi.org/10.1016/j.neucom.2011.03.054

Digital Library

[16]

Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin Kam Ho. 2019. How Complex Is Your Classification Problem? A Survey on Measuring Classification Complexity. ACM Comput. Surv. 52, 5, Article 107 (Sept. 2019), 34 pages. https://doi.org/10.1145/3347711

Digital Library

[17]

Martin Maechler, Peter Rousseeuw, Anja Struyf, Mia Hubert, and Kurt Hornik. 2019. cluster: Cluster Analysis Basics and Extensions. CRAN 1(2019), 82 pages. R package version 2.0.8.

[18]

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK. http://nlp.stanford.edu/IR-book/information-retrieval-book.html

Digital Library

[19]

Ramón A. Mollineda, J. Salvador Sánchez, and José M. Sotoca. 2005. Data Characterization for Effective Prototype Selection. In Pattern Recognition and Image Analysis, Jorge S. Marques, Nicolás Pérez de la Blanca, and Pedro Pina (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 27–34.

[20]

Gleison Morais and Ronaldo Prati. 2013. Complex Network Measures for Data Set Characterization, In 2013 Brazilian Conference on Intelligent Systems. Proceedings - 2013 Brazilian Conference on Intelligent Systems, BRACIS 2013 1, 12–18. https://doi.org/10.1109/BRACIS.2013.11

Digital Library

[21]

D. Mukunthu, P. Shah, and W.H. Tok. 2019. Practical Automated Machine Learning on Azure: Using Azure Machine Learning to Quickly Build AI Solutions. O’Reilly Media, Incorporated, USA. https://books.google.co.in/books?id=CgB4xgEACAAJ

[22]

A Orriols-Puig, Núria Macià, and Tin Ho. 2010. DCoL: Data Complexity Library in C++ (Documentation). SourceForge 1 (01 2010).

[23]

C. Radhakrishna Rao. 1948. The Utilization of Multiple Measurements in Problems of Biological Classification. Journal of the Royal Statistical Society: Series B (Methodological) 10, 2(1948), 159–193. https://doi.org/10.1111/j.2517-6161.1948.tb00008.x arXiv:https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.2517-6161.1948.tb00008.x

[24]

Andrew Rosenberg and Julia Hirschberg. 2007. V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Association for Computational Linguistics, Prague, Czech Republic, 410–420. https://www.aclweb.org/anthology/D07-1043

[25]

Peter Rousseeuw. 1987. Rousseeuw, P.J.: Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Comput. Appl. Math. 20, 53-65. J. Comput. Appl. Math. 20 (11 1987), 53–65. https://doi.org/10.1016/0377-0427(87)90125-7

Digital Library

[26]

B. Seijo-Pardo, V. Bolón-Canedo, and A. Alonso-Betanzos. 2019. On developing an automatic threshold applied to feature selection ensembles. Information Fusion 45(2019), 227 – 245. https://doi.org/10.1016/j.inffus.2018.02.007

[27]

Ajay Kumar Tanwani and Muddassar Farooq. 2010. Classification Potential vs. Classification Accuracy: A Comprehensive Study of Evolutionary Algorithms with Biomedical Datasets. In Learning Classifier Systems, Jaume Bacardit, Will Browne, Jan Drugowitsch, Ester Bernadó-Mansilla, and Martin V. Butz (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 127–144.

[28]

Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2013. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, ACM, Univerity of British Columbia, Vancouver, 847–855. https://arxiv.org/pdf/1208.3719

Digital Library

[29]

Roman Vainshtein, Asnat Greenstein-Messica, Gilad Katz, Bracha Shapira, and Lior Rokach. 2018. A Hybrid Approach for Automatic Model Recommendation. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (Torino, Italy) (CIKM ’18). Association for Computing Machinery, New York, NY, USA, 1623–1626. https://doi.org/10.1145/3269206.3269299

Digital Library

Cited By

Santhiappan SShravan NRavindran B(2023)CIAMS: clustering indices-based automatic classification model selectionInternational Journal of Data Science and Analytics10.1007/s41060-023-00441-5Online publication date: 19-Aug-2023
https://doi.org/10.1007/s41060-023-00441-5

Recommendations

AutoCluster: Meta-learning Based Ensemble Method for Automated Unsupervised Clustering
Advances in Knowledge Discovery and Data Mining
Abstract
Automated clustering automatically builds appropriate clustering models. The existing automated clustering methods are widely based on meta-learning. However, it still faces specific challenges: lacking comprehensive meta-features for meta-... $^{}$ $^{}$
Optimal construction of one-against-one classifier based on meta-learning

A commonly used strategy for solving a multi-class classification problem is to decompose the original problem into several binary subproblems. The recently proposed method, diversified one-against-one (DOAO), constructs a one-against-one classifier by ...
Evaluating Clustering Meta-features for Classifier Recommendation
Intelligent Systems
Abstract
Data availability in a wide variety of domains has boosted the use of Machine Learning techniques for knowledge discovery and classification. The performance of a technique in a given classification task is significantly impacted by specific ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CODS-COMAD '21: Proceedings of the 3rd ACM India Joint International Conference on Data Science & Management of Data (8th ACM IKDD CODS & 26th COMAD)

January 2021

453 pages

ISBN:9781450388177

DOI:10.1145/3430984

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 January 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Claritrics Inc

Conference

CODS COMAD 2021

CODS COMAD 2021: 8th ACM IKDD CODS and 26th COMAD

January 2 - 4, 2021

Bangalore, India

Acceptance Rates

Overall Acceptance Rate 197 of 680 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
97
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Santhiappan SShravan NRavindran B(2023)CIAMS: clustering indices-based automatic classification model selectionInternational Journal of Data Science and Analytics10.1007/s41060-023-00441-5Online publication date: 19-Aug-2023
https://doi.org/10.1007/s41060-023-00441-5

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents