research-article

Automatic selection of clustering algorithms using supervised graph embedding

Authors:

Noy Cohen-Shapira,

Lior RokachAuthors Info & Claims

Volume 577, Issue C

Pages 824 - 851

https://doi.org/10.1016/j.ins.2021.08.028

Published: 01 October 2021 Publication History

Abstract

The widespread adoption of machine learning (ML) techniques and the extensive expertise required to apply them have led to increased interest in automated ML solutions that reduce the need for human intervention. One of the main challenges in applying ML to previously unseen problems is algorithm selection – the identification of high-performing algorithm(s) for a given dataset, task, and evaluation measure. This study addresses the algorithm selection challenge for data clustering, a fundamental task in data mining that is aimed at grouping similar objects. We present MARCO-GE, a novel meta-learning approach for the automated recommendation of clustering algorithms. MARCO-GE first transforms datasets into graphs and then utilizes a graph convolutional neural network technique to extract their latent representation. Using the embedding representations obtained, MARCO-GE trains a ranking meta-model capable of accurately recommending top-performing algorithms for a new dataset and clustering evaluation measure. An extensive evaluation on 210 datasets, 17 clustering algorithms, and 10 clustering measures demonstrates the effectiveness of our approach and its superiority in terms of predictive and generalization performance over state-of-the-art clustering meta-learning approaches.

References

[1]

A.K. Jain, Data clustering: 50 years beyond k-means, Pattern Recognition Letters 31 (8) (2010) 651–666.

Digital Library

[2]

E.R. Hruschka, R.J. Campello, A.A. Freitas, et al., A survey of evolutionary algorithms for clustering, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 39 (2) (2009) 133–155.

[3]

D. Zakrzewska, J. Murlewski, Clustering algorithms for bank customer segmentation, in: 5th International Conference on Intelligent Systems Design and Applications (ISDA’05), IEEE, 2005, pp. 197–202.

[4]

M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, F. Hutter, Efficient and robust automated machine learning, in: Advances in Neural Information Processing Systems, 2015, pp. 2962–2970.

[5]

C. Thornton, F. Hutter, H.H. Hoos, K. Leyton-Brown, Auto-weka: Combined selection and hyperparameter optimization of classification algorithms, in: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013, pp. 847–855.

Digital Library

[6]

I. Drori, Y. Krishnamurthy, R. Rampin, R. Lourenço, J. Ono, K. Cho, C. Silva, J. Freire, Alphad3m: Machine learning pipeline synthesis, in: AutoML Workshop at ICML, 2018.

[7]

S.N. das Dôres, L. Alves, D.D. Ruiz, R.C. Barros, A meta-learning framework for algorithm recommendation in software fault prediction, in: Proceedings of the 31st Annual ACM Symposium on Applied Computing, 2016, pp. 1486–1491.

[8]

N. Cohen-Shapira, L. Rokach, B. Shapira, G. Katz, R. Vainshtein, Autogrd: Model recommendation through graphical dataset representation, in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019, pp. 821–830.

[9]

P.B. Brazdil, C. Soares, J.P. Da Costa, Ranking learning algorithms: Using ibl and meta-learning on accuracy and time results, Machine Learning 50 (3) (2003) 251–277.

[10]

M.C. De Souto, R.B. Prudencio, R.G. Soares, D.S. De Araujo, I.G. Costa, T.B. Ludermir, A. Schliep, Ranking and selecting clustering algorithms using a meta-learning approach, in: 2008 IEEE International Joint Conference on Neural Networks, 2008, pp. 3729–3735.

[11]

D.G. Ferrari, L.N. De Castro, Clustering algorithm selection by meta-learning systems: A new distance-based problem characterization and ranking combination methods, Information Sciences 301 (2015) 181–194.

[12]

B.A. Pimentel, A.C. de Carvalho, A new data characterization for selecting clustering algorithms using meta-learning, Information Sciences 477 (2019) 203–219.

[13]

M.A. Muñoz, Y. Sun, M. Kirley, S.K. Halgamuge, Algorithm selection for black-box continuous optimization problems: A survey on methods and challenges, Information Sciences 317 (2015) 224–245.

[14]

E. Alcobaça, F. Siqueira, A. Rivolli, L.P.F. Garcia, J.T. Oliva, A.C. de Carvalho, et al., Mfe: Towards reproducible meta-feature extraction, Journal of Machine Learning Research 21 (2020) 111–1.

[15]

M. Vukicevic, S. Radovanovic, B. Delibasic, M. Suknovic, Extending meta-learning framework for clustering gene expression data with component-based algorithm design and internal evaluation measures, International Journal of Data Mining and Bioinformatics 14 (2) (2016) 101–119.

[16]

D. Dua, C. Graff, UCI machine learning repository (2017). URL: http://archive.ics.uci.edu/ml.

[17]

D. Tschechlov, Analysis and transfer of automl concepts for clustering algorithms, Master’s thesis, 2019.

[18]

P. Goyal, E. Ferrara, Graph embedding techniques, applications, and performance: A survey, Knowledge-Based Systems 151 (2018) 78–94.

[19]

M. Niepert, M. Ahmed, K. Kutzkov, Learning convolutional neural networks for graphs, in: International Conference on Machine Learning, 2016, pp. 2014–2023.

[20]

R. Cañamares, P. Castells, Should i follow the crowd? a probabilistic analysis of the effectiveness of popularity in recommender systems, SIGIR ’18, Association for Computing Machinery, New York, NY, USA, 2018, pp. 415–424.

[21]

S. García, J. Luengo, F. Herrera, Data Preprocessing in Data Mining, Springer, 2015.

Digital Library

[22]

G.H. Dunteman, Principal Components Analysis, no. 69, Sage, 1989.

[23]

U. Von Luxburg, A tutorial on spectral clustering, Statistics and Computing 17 (4) (2007) 395–416.

Digital Library

[24]

T.N. Kipf, M. Welling, Semi-supervised classification with graph convolutional networks, arXiv preprint arXiv:1609.02907.

[25]

B. Perozzi, R. Al-Rfou, S. Skiena, Deepwalk: Online learning of social representations, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014, pp. 701–710.

Digital Library

[26]

V.S. Alves, R.J. Campello, E.R. Hruschka, Towards a fast evolutionary algorithm for clustering, in: 2006 IEEE International Conference on Evolutionary Computation, IEEE, 2006, pp. 1776–1783.

[27]

D. Van der Merwe, A.P. Engelbrecht, Data clustering using particle swarm optimization, in: The 2003 Congress on Evolutionary Computation, 2003. CEC’03., vol. 1, IEEE, 2003, pp. 215–220.

[28]

O. Grygorash, Y. Zhou, Z. Jorgensen, Minimum spanning tree based clustering algorithms, in: 2006 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’06), 2006, pp. 73–81.

[29]

F. Murtagh, P. Contreras, Algorithms for hierarchical clustering: an overview, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2 (1) (2012) 86–97.

[30]

R. Xu, D. Wunsch, Survey of clustering algorithms, IEEE Transactions on Neural Networks 16 (3) (2005) 645–678.

Digital Library

[31]

D. Defays, An efficient algorithm for a complete link method, The Computer Journal 20 (4) (1977) 364–366.

[32]

B. Zhang, M. Hsu, U. Dayal, K-harmonic means-a spatial clustering algorithm with boosting, in: International Workshop on Temporal, Spatial, and Spatio-Temporal Data Mining, Springer, 2000, pp. 31–45.

[33]

I.S. Dhillon, Y. Guan, B. Kulis, Kernel k-means: spectral clustering and normalized cuts, in: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 551–556.

Digital Library

[34]

Y. Man, I. Gath, Detection and separation of ring-shaped clusters using fuzzy clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence 16 (8) (1994) 855–861.

[35]

M. Ester, H. Kriegel, J. Sander, X. Xu, et al., A density-based algorithm for discovering clusters in large spatial databases with noise, in: Kdd, vol. 96, 1996, pp. 226–231.

[36]

D. Comaniciu, P. Meer, Mean shift: A robust approach toward feature space analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (5) (2002) 603–619.

Digital Library

[37]

D.A. Reynolds, Gaussian mixture models, Encyclopedia of Biometrics 741 (2009) 659–663.

[38]

J. Bergstra, D. Yamins, D.D. Cox, et al., Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms, in: Proceedings of the 12th Python in Science Conference, vol. 13, Citeseer, 2013, p. 20.

[39]

Y. Poulakis, C. Doulkeridis, D. Kyriazis, Autoclust: A framework for automated clustering based on cluster validity indices, in: 2020 IEEE International Conference on Data Mining (ICDM), IEEE, 2020, pp. 1220–1225.

[40]

T. Van Craenendonck, H. Blockeel, Using internal validity measures to compare clustering algorithms, Benelearn 2015 Poster presentations (online) (2015) 1–8.

[41]

Y. Liu, Z. Li, H. Xiong, X. Gao, J. Wu, Understanding of internal clustering validation measures, in: 2010 IEEE International Conference on Data Mining, IEEE, 2010, pp. 911–916.

[42]

J.C. Bezdek, N.R. Pal, Some new indexes of cluster validity, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 28 (3) (1998) 301–315.

[43]

J.C. Dunn, A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters.

[44]

G.W. Milligan, M.C. Cooper, An examination of procedures for determining the number of clusters in a data set, Psychometrika 50 (2) (1985) 159–179.

[45]

J. Handl, J. Knowles, D.B. Kell, Computational cluster validation in post-genomic data analysis, Bioinformatics 21 (15) (2005) 3201–3212.

[46]

L.J. Hubert, J.R. Levin, A general statistical framework for assessing categorical clustering in free recall, Psychological Bulletin 83 (6) (1976) 1072.

[47]

M. Halkidi, Y. Batistakis, M. Vazirgiannis, On clustering validation techniques, Journal of Intelligent Information Systems 17 (2–3) (2001) 107–145.

[48]

M. Wang, L. Yu, D. Zheng, Q. Gan, Y. Gai, Z. Ye, M. Li, J. Zhou, Q. Huang, C. Ma, et al., Deep graph library: Towards efficient and scalable deep learning on graphs, arXiv preprint arXiv:1909.01315.

[49]

J. Demšar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (Jan) (2006) 1–30.

[50]

T. Pimentel, R. Castro, A. Veloso, N. Ziviani, Efficient estimation of node representations in large graphs using linear contexts, in: International Joint Conference on Neural Networks (IJCNN), 2019, pp. 1–8.

Cited By

Poulakis YDoulkeridis CKyriazis D(2024)A Survey on AutoML Methods and Systems for ClusteringACM Transactions on Knowledge Discovery from Data10.1145/364356418:5(1-30)Online publication date: 26-Jan-2024
https://dl.acm.org/doi/10.1145/3643564
Setiawan YMaulidevi NSurendro K(2023)The Use of Dynamic n-Gram to Enhance TF-IDF Features Extraction for Bahasa Indonesia Cyberbullying ClassificationProceedings of the 2023 12th International Conference on Software and Computer Applications10.1145/3587828.3587858(200-205)Online publication date: 23-Feb-2023
https://dl.acm.org/doi/10.1145/3587828.3587858
Hu ZHuang YZheng HZheng MLiu J(2023)Graph-based fine-grained model selection for multi-source domainPattern Analysis & Applications10.1007/s10044-023-01176-626:3(1481-1492)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1007/s10044-023-01176-6
Show More Cited By

Index Terms

Automatic selection of clustering algorithms using supervised graph embedding
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
    2. Machine learning approaches
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
  2. Information systems applications
    1. Data mining

Index terms have been assigned to the content through auto-classification.

Recommendations

Learning dataset representation for automatic machine learning algorithm selection
Abstract
The algorithm selection problem is defined as identifying the best-performing machine learning (ML) algorithm for a given combination of dataset, task, and evaluation measure. The human expertise required to evaluate the increasing number of ML ...
Clustering algorithm selection by meta-learning systems

Data clustering aims to segment a database into groups of objects based on the similarity among these objects. Due to its unsupervised nature, the search for a good-quality solution can become a complex process. There is currently a wide range of ...
A Survey on AutoML Methods and Systems for Clustering
Automated Machine Learning (AutoML) aims to identify the best-performing machine learning algorithm along with its input parameters for a given dataset and a specific machine learning task. This is a challenging problem, as the process of finding the best ...

Comments

Information & Contributors

Information

Published In

cover image Information Sciences: an International Journal

Information Sciences: an International Journal Volume 577, Issue C

Oct 2021

902 pages

ISSN:0020-0255

Issue’s Table of Contents

Elsevier Inc.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 October 2021

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Poulakis YDoulkeridis CKyriazis D(2024)A Survey on AutoML Methods and Systems for ClusteringACM Transactions on Knowledge Discovery from Data10.1145/364356418:5(1-30)Online publication date: 26-Jan-2024
https://dl.acm.org/doi/10.1145/3643564
Setiawan YMaulidevi NSurendro K(2023)The Use of Dynamic n-Gram to Enhance TF-IDF Features Extraction for Bahasa Indonesia Cyberbullying ClassificationProceedings of the 2023 12th International Conference on Software and Computer Applications10.1145/3587828.3587858(200-205)Online publication date: 23-Feb-2023
https://dl.acm.org/doi/10.1145/3587828.3587858
Hu ZHuang YZheng HZheng MLiu J(2023)Graph-based fine-grained model selection for multi-source domainPattern Analysis & Applications10.1007/s10044-023-01176-626:3(1481-1492)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1007/s10044-023-01176-6
Cohen-Shapira NRokach L(2022)Learning dataset representation for automatic machine learning algorithm selectionKnowledge and Information Systems10.1007/s10115-022-01716-264:10(2599-2635)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1007/s10115-022-01716-2

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents