Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Automatic selection of clustering algorithms using supervised graph embedding

Published: 01 October 2021 Publication History

Abstract

The widespread adoption of machine learning (ML) techniques and the extensive expertise required to apply them have led to increased interest in automated ML solutions that reduce the need for human intervention. One of the main challenges in applying ML to previously unseen problems is algorithm selection – the identification of high-performing algorithm(s) for a given dataset, task, and evaluation measure. This study addresses the algorithm selection challenge for data clustering, a fundamental task in data mining that is aimed at grouping similar objects. We present MARCO-GE, a novel meta-learning approach for the automated recommendation of clustering algorithms. MARCO-GE first transforms datasets into graphs and then utilizes a graph convolutional neural network technique to extract their latent representation. Using the embedding representations obtained, MARCO-GE trains a ranking meta-model capable of accurately recommending top-performing algorithms for a new dataset and clustering evaluation measure. An extensive evaluation on 210 datasets, 17 clustering algorithms, and 10 clustering measures demonstrates the effectiveness of our approach and its superiority in terms of predictive and generalization performance over state-of-the-art clustering meta-learning approaches.

References

[1]
A.K. Jain, Data clustering: 50 years beyond k-means, Pattern Recognition Letters 31 (8) (2010) 651–666.
[2]
E.R. Hruschka, R.J. Campello, A.A. Freitas, et al., A survey of evolutionary algorithms for clustering, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 39 (2) (2009) 133–155.
[3]
D. Zakrzewska, J. Murlewski, Clustering algorithms for bank customer segmentation, in: 5th International Conference on Intelligent Systems Design and Applications (ISDA’05), IEEE, 2005, pp. 197–202.
[4]
M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, F. Hutter, Efficient and robust automated machine learning, in: Advances in Neural Information Processing Systems, 2015, pp. 2962–2970.
[5]
C. Thornton, F. Hutter, H.H. Hoos, K. Leyton-Brown, Auto-weka: Combined selection and hyperparameter optimization of classification algorithms, in: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013, pp. 847–855.
[6]
I. Drori, Y. Krishnamurthy, R. Rampin, R. Lourenço, J. Ono, K. Cho, C. Silva, J. Freire, Alphad3m: Machine learning pipeline synthesis, in: AutoML Workshop at ICML, 2018.
[7]
S.N. das Dôres, L. Alves, D.D. Ruiz, R.C. Barros, A meta-learning framework for algorithm recommendation in software fault prediction, in: Proceedings of the 31st Annual ACM Symposium on Applied Computing, 2016, pp. 1486–1491.
[8]
N. Cohen-Shapira, L. Rokach, B. Shapira, G. Katz, R. Vainshtein, Autogrd: Model recommendation through graphical dataset representation, in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019, pp. 821–830.
[9]
P.B. Brazdil, C. Soares, J.P. Da Costa, Ranking learning algorithms: Using ibl and meta-learning on accuracy and time results, Machine Learning 50 (3) (2003) 251–277.
[10]
M.C. De Souto, R.B. Prudencio, R.G. Soares, D.S. De Araujo, I.G. Costa, T.B. Ludermir, A. Schliep, Ranking and selecting clustering algorithms using a meta-learning approach, in: 2008 IEEE International Joint Conference on Neural Networks, 2008, pp. 3729–3735.
[11]
D.G. Ferrari, L.N. De Castro, Clustering algorithm selection by meta-learning systems: A new distance-based problem characterization and ranking combination methods, Information Sciences 301 (2015) 181–194.
[12]
B.A. Pimentel, A.C. de Carvalho, A new data characterization for selecting clustering algorithms using meta-learning, Information Sciences 477 (2019) 203–219.
[13]
M.A. Muñoz, Y. Sun, M. Kirley, S.K. Halgamuge, Algorithm selection for black-box continuous optimization problems: A survey on methods and challenges, Information Sciences 317 (2015) 224–245.
[14]
E. Alcobaça, F. Siqueira, A. Rivolli, L.P.F. Garcia, J.T. Oliva, A.C. de Carvalho, et al., Mfe: Towards reproducible meta-feature extraction, Journal of Machine Learning Research 21 (2020) 111–1.
[15]
M. Vukicevic, S. Radovanovic, B. Delibasic, M. Suknovic, Extending meta-learning framework for clustering gene expression data with component-based algorithm design and internal evaluation measures, International Journal of Data Mining and Bioinformatics 14 (2) (2016) 101–119.
[16]
D. Dua, C. Graff, UCI machine learning repository (2017). URL: http://archive.ics.uci.edu/ml.
[17]
D. Tschechlov, Analysis and transfer of automl concepts for clustering algorithms, Master’s thesis, 2019.
[18]
P. Goyal, E. Ferrara, Graph embedding techniques, applications, and performance: A survey, Knowledge-Based Systems 151 (2018) 78–94.
[19]
M. Niepert, M. Ahmed, K. Kutzkov, Learning convolutional neural networks for graphs, in: International Conference on Machine Learning, 2016, pp. 2014–2023.
[20]
R. Cañamares, P. Castells, Should i follow the crowd? a probabilistic analysis of the effectiveness of popularity in recommender systems, SIGIR ’18, Association for Computing Machinery, New York, NY, USA, 2018, pp. 415–424.
[21]
S. García, J. Luengo, F. Herrera, Data Preprocessing in Data Mining, Springer, 2015.
[22]
G.H. Dunteman, Principal Components Analysis, no. 69, Sage, 1989.
[23]
U. Von Luxburg, A tutorial on spectral clustering, Statistics and Computing 17 (4) (2007) 395–416.
[24]
T.N. Kipf, M. Welling, Semi-supervised classification with graph convolutional networks, arXiv preprint arXiv:1609.02907.
[25]
B. Perozzi, R. Al-Rfou, S. Skiena, Deepwalk: Online learning of social representations, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014, pp. 701–710.
[26]
V.S. Alves, R.J. Campello, E.R. Hruschka, Towards a fast evolutionary algorithm for clustering, in: 2006 IEEE International Conference on Evolutionary Computation, IEEE, 2006, pp. 1776–1783.
[27]
D. Van der Merwe, A.P. Engelbrecht, Data clustering using particle swarm optimization, in: The 2003 Congress on Evolutionary Computation, 2003. CEC’03., vol. 1, IEEE, 2003, pp. 215–220.
[28]
O. Grygorash, Y. Zhou, Z. Jorgensen, Minimum spanning tree based clustering algorithms, in: 2006 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’06), 2006, pp. 73–81.
[29]
F. Murtagh, P. Contreras, Algorithms for hierarchical clustering: an overview, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2 (1) (2012) 86–97.
[30]
R. Xu, D. Wunsch, Survey of clustering algorithms, IEEE Transactions on Neural Networks 16 (3) (2005) 645–678.
[31]
D. Defays, An efficient algorithm for a complete link method, The Computer Journal 20 (4) (1977) 364–366.
[32]
B. Zhang, M. Hsu, U. Dayal, K-harmonic means-a spatial clustering algorithm with boosting, in: International Workshop on Temporal, Spatial, and Spatio-Temporal Data Mining, Springer, 2000, pp. 31–45.
[33]
I.S. Dhillon, Y. Guan, B. Kulis, Kernel k-means: spectral clustering and normalized cuts, in: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 551–556.
[34]
Y. Man, I. Gath, Detection and separation of ring-shaped clusters using fuzzy clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence 16 (8) (1994) 855–861.
[35]
M. Ester, H. Kriegel, J. Sander, X. Xu, et al., A density-based algorithm for discovering clusters in large spatial databases with noise, in: Kdd, vol. 96, 1996, pp. 226–231.
[36]
D. Comaniciu, P. Meer, Mean shift: A robust approach toward feature space analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (5) (2002) 603–619.
[37]
D.A. Reynolds, Gaussian mixture models, Encyclopedia of Biometrics 741 (2009) 659–663.
[38]
J. Bergstra, D. Yamins, D.D. Cox, et al., Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms, in: Proceedings of the 12th Python in Science Conference, vol. 13, Citeseer, 2013, p. 20.
[39]
Y. Poulakis, C. Doulkeridis, D. Kyriazis, Autoclust: A framework for automated clustering based on cluster validity indices, in: 2020 IEEE International Conference on Data Mining (ICDM), IEEE, 2020, pp. 1220–1225.
[40]
T. Van Craenendonck, H. Blockeel, Using internal validity measures to compare clustering algorithms, Benelearn 2015 Poster presentations (online) (2015) 1–8.
[41]
Y. Liu, Z. Li, H. Xiong, X. Gao, J. Wu, Understanding of internal clustering validation measures, in: 2010 IEEE International Conference on Data Mining, IEEE, 2010, pp. 911–916.
[42]
J.C. Bezdek, N.R. Pal, Some new indexes of cluster validity, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 28 (3) (1998) 301–315.
[43]
J.C. Dunn, A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters.
[44]
G.W. Milligan, M.C. Cooper, An examination of procedures for determining the number of clusters in a data set, Psychometrika 50 (2) (1985) 159–179.
[45]
J. Handl, J. Knowles, D.B. Kell, Computational cluster validation in post-genomic data analysis, Bioinformatics 21 (15) (2005) 3201–3212.
[46]
L.J. Hubert, J.R. Levin, A general statistical framework for assessing categorical clustering in free recall, Psychological Bulletin 83 (6) (1976) 1072.
[47]
M. Halkidi, Y. Batistakis, M. Vazirgiannis, On clustering validation techniques, Journal of Intelligent Information Systems 17 (2–3) (2001) 107–145.
[48]
M. Wang, L. Yu, D. Zheng, Q. Gan, Y. Gai, Z. Ye, M. Li, J. Zhou, Q. Huang, C. Ma, et al., Deep graph library: Towards efficient and scalable deep learning on graphs, arXiv preprint arXiv:1909.01315.
[49]
J. Demšar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (Jan) (2006) 1–30.
[50]
T. Pimentel, R. Castro, A. Veloso, N. Ziviani, Efficient estimation of node representations in large graphs using linear contexts, in: International Joint Conference on Neural Networks (IJCNN), 2019, pp. 1–8.

Cited By

View all
  • (2024)A Survey on AutoML Methods and Systems for ClusteringACM Transactions on Knowledge Discovery from Data10.1145/364356418:5(1-30)Online publication date: 26-Jan-2024
  • (2023)The Use of Dynamic n-Gram to Enhance TF-IDF Features Extraction for Bahasa Indonesia Cyberbullying ClassificationProceedings of the 2023 12th International Conference on Software and Computer Applications10.1145/3587828.3587858(200-205)Online publication date: 23-Feb-2023
  • (2023)Graph-based fine-grained model selection for multi-source domainPattern Analysis & Applications10.1007/s10044-023-01176-626:3(1481-1492)Online publication date: 1-Aug-2023
  • Show More Cited By

Index Terms

  1. Automatic selection of clustering algorithms using supervised graph embedding
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Information Sciences: an International Journal
          Information Sciences: an International Journal  Volume 577, Issue C
          Oct 2021
          902 pages

          Publisher

          Elsevier Science Inc.

          United States

          Publication History

          Published: 01 October 2021

          Author Tags

          1. Meta-learning
          2. Algorithm selection
          3. Clustering
          4. AutoML
          5. Algorithm ranking

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 17 Oct 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)A Survey on AutoML Methods and Systems for ClusteringACM Transactions on Knowledge Discovery from Data10.1145/364356418:5(1-30)Online publication date: 26-Jan-2024
          • (2023)The Use of Dynamic n-Gram to Enhance TF-IDF Features Extraction for Bahasa Indonesia Cyberbullying ClassificationProceedings of the 2023 12th International Conference on Software and Computer Applications10.1145/3587828.3587858(200-205)Online publication date: 23-Feb-2023
          • (2023)Graph-based fine-grained model selection for multi-source domainPattern Analysis & Applications10.1007/s10044-023-01176-626:3(1481-1492)Online publication date: 1-Aug-2023
          • (2022)Learning dataset representation for automatic machine learning algorithm selectionKnowledge and Information Systems10.1007/s10115-022-01716-264:10(2599-2635)Online publication date: 1-Oct-2022

          View Options

          View options

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media