poster

Approximate kernel k-means: solution to large scale kernel clustering

Authors:

Timothy C. Havens,

Anil K. JainAuthors Info & Claims

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 895 - 903

https://doi.org/10.1145/2020408.2020558

Published: 21 August 2011 Publication History

Abstract

Digital data explosion mandates the development of scalable tools to organize the data in a meaningful and easily accessible form. Clustering is a commonly used tool for data organization. However, many clustering algorithms designed to handle large data sets assume linear separability of data and hence do not perform well on real world data sets. While kernel-based clustering algorithms can capture the non-linear structure in data, they do not scale well in terms of speed and memory requirements when the number of objects to be clustered exceeds tens of thousands. We propose an approximation scheme for kernel k-means, termed approximate kernel k-means, that reduces both the computational complexity and the memory requirements by employing a randomized approach. We show both analytically and empirically that the performance of approximate kernel k-means is similar to that of the kernel k-means algorithm, but with dramatically reduced run-time complexity and memory requirements.

References

[1]

http://gigaom.files.wordpress.com/2010/05/2010-digital-universe-iview_5--4%-10.pdf.

[2]

http://yann.lecun.com/exdb/mnist.

[3]

C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu. A framework for clustering evolving data streams. In Proceedings of the International Conference on Very Large Databases, pages 81--92, 2003.

Digital Library

[4]

M.A. Belabbas and P.J. Wolfe. Spectral methods in machine learning and new strategies for very large datasets. Proceedings of the National Academy of Sciences, 106(2):369--374, 2009.

[5]

F. Can. Incremental clustering for dynamic information processing. ACM Transactions on Information Systems, 11(2):143--164, 1993.

Digital Library

[6]

F. Can, E.A. Fox, C.D. Snavely, and R.K. France. Incremental clustering for very large document databases: Initial MARIAN experience. Information Sciences, 84(1--2):101--114, 1995.

Digital Library

[7]

C.T. Chu, S.K. Kim, Y.A. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In Advances in Neural Information Processing Systems 19, pages 281--288, 2007.

[8]

S. Daruru, N.M. Marin, M. Walker, and J. Ghosh. Pervasive parallelism in data mining: dataflow solution to co-clustering large and sparse netflix data. In Proceedings of the SIGKDD conference on Knowledge Discovery and Data mining, pages 1115--1124, 2009.

Digital Library

[9]

A.S. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In Proceedings of the International Conference on World Wide Web, pages 271--280, 2007.

Digital Library

[10]

S. Datta, C. Giannella, and H. Kargupta. K-means clustering over a large, dynamic network. In Proceedings of the SIAM Data Mining Conference, pages 153--164, 2006.

[11]

J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009.

[12]

I. Dhillon, Y. Guan, and B. Kulis. A unified view of kernel k-means, spectral clustering and graph cuts. Technical report, University of Texas at Austin, 2004. (Tech. rep. TR-04--25).

[13]

C. Ding, X. He, and H.D. Simon. On the equivalence of nonnegative matrix factorization and spectral clustering. In Proceedings of the SIAM Data Mining Conference, pages 606--610, 2005.

[14]

P. Drineas and M.W. Mahoney. On the Nystrom method for approximating a Gram matrix for improved kernel-based learning. The Journal of Machine Learning Research, 6:2153--2175, 2005.

Digital Library

[15]

D. Foti, D. Lipari, C. Pizzuti, and D. Talia. Scalable parallel clustering for data mining on multicomputers. Parallel and Distributed Processing, pages 390--398, 2000.

Digital Library

[16]

C. Fowlkes, S. Belongie, F. Chung, and J. Malik. Spectral grouping using the Nystrom method. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 214--225, 2004.

Digital Library

[17]

M. Girolami. Mercer kernel-based clustering in feature space. IEEE Transactions on Neural Networks, 13(3):780--784, 2002.

Digital Library

[18]

S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering, pages 515--528, 2003.

Digital Library

[19]

S. Guha, R. Rastogi, and K. Shim. Cure: an efficient clustering algorithm for large databases. Information Systems, 26(1):35--58, 2001.

Digital Library

[20]

W. Hackbusch. Integral equations: Theory and Numerical treatment. Birkhauser, 1995.

Digital Library

[21]

S. Har-Peled and S. Mazumdar. On coresets for k-means and k-median clustering. In Proceedings of the ACM Symposium on Theory of Computing, pages 291--300, 2004.

Digital Library

[22]

L. Hubert and P. Arabie. Comparing partitions. Journal of classification, 2(1):193--218, 1985.

[23]

R. Inokuchi and S. Miyamoto. LVQ clustering and SOM using a kernel function. In IEEE International Conference on Fuzzy Systems, volume 3, pages 1497--1500, 2005.

[24]

D. Jiang, C. Tang, and A. Zhang. Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge and Data Engineering, 16(11):1370--1386, 2004.

Digital Library

[25]

D. Judd, P.K. McKinley, and A.K. Jain. Large-scale parallel data clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):871--876, 1998.

Digital Library

[26]

L. Kaufman and P.J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Blackwell, 2005.

[27]

D.W. Kim, K.Y. Lee, D. Lee, and K.H. Lee. Evaluation of the performance of clustering algorithms in kernel-induced feature space. Pattern Recognition, 38(4):607--611, 2005.

Digital Library

[28]

S. Kumar, M. Mohri, and A. Talwalkar. Sampling techniques for the Nystrom method. In Proceedings of the Conference on Artificial Intelligence and Statistics, pages 304--311, 2009.

[29]

T.O. Kvalseth. Entropy and correlation: Some comments. IEEE Transactions on Systems, Man and Cybernetics, 17(3):517--519, 1987.

[30]

S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 2169--2178, 2006.

Digital Library

[31]

L.L. Liu, X.B. Wen, and X.X. Gao. Segmentation for SAR Image Based on a New Spectral Clustering Algorithm. Life System Modeling and Intelligent Computing, pages 635--643, 2010.

Digital Library

[32]

D. MacDonald and C. Fyfe. The kernel self-organising map. In Proceedings of the International Conference on Knowledge-Based Intelligent Engineering Systems and Allied Technologies, volume 1, pages 317--320, 2002.

[33]

D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent dirichlet allocation. In Advances in Neural Information Processing Systems, volume 20, pages 17--24, 2007.

[34]

A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14, pages 849--856, 2001.

[35]

R.T. Ng and J. Han. CLARANS: A method for clustering objects for spatial data mining. IEEE Transactions on Knowledge and Data Engineering, pages 1003--1016, 2002.

Digital Library

[36]

A. K. Qinand and P. N. Suganthan. Kernel neural gas algorithms with application to cluster analysis. Pattern Recognition, 4:617--620, 2004.

Digital Library

[37]

T. Sakai and A. Imiya. Fast spectral clustering with random projection and sampling. Machine Learning and Data Mining in Pattern Recognition, pages 372--384, 2009.

Digital Library

[38]

B. Scholkopf, R. Herbrich, and A. Smola. A generalized representer theorem. In Proceedings of Computational Learning Theory, pages 416--426, 2001.

Digital Library

[39]

B. Scholkopf, A. Smola, and K.R. Muller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299--1314, 1996.

Digital Library

[40]

J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.

Digital Library

[41]

J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888--905, 2002.

Digital Library

[42]

A. Vedaldi and B. Fulkerson. Vlfeat: An open and portable library of computer vision algorithms. http://www.vlfeat.org, 2008.

Digital Library

[43]

C. Williams and M. Seeger. Using the Nystrom method to speed up kernel machines. In Advances in Neural Information Processing Systems 13, pages 682--688, 2001.

[44]

W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In Proceedings of the ACM SIGIR Conference, pages 267--273, 2003.

Digital Library

[45]

H. Zha, X. He, C. Ding, M. Gu, and H. Simon. Spectral relaxation for k-means clustering. In Advances in Neural Information Processing Systems, volume 2, pages 1057--1064, 2002.

[46]

R. Zhang and A.I. Rudnicky. A large scale clustering scheme for kernel k-means. In Proceedings of the International Conference on Pattern Recognition, pages 289--292, 2002.

Digital Library

[47]

T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Record, 25(2):103--114, 1996.

Digital Library

Cited By

Du KJiang BLu JHua JSwamy M(2024)Exploring Kernel Machines and Support Vector Machines: Principles, Techniques, and Future DirectionsMathematics10.3390/math1224393512:24(3935)Online publication date: 13-Dec-2024
https://doi.org/10.3390/math12243935
Wu DYang ZLu JXu JXu XNie F(2024)EBMGC-GNF: Efficient Balanced Multi-View Graph Clustering via Good Neighbor FusionIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.339822046:12(7878-7892)Online publication date: Dec-2024
https://doi.org/10.1109/TPAMI.2024.3398220
Zhou XWang X(2024) Memory and Communication Efficient Federated Kernel k -Means IEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.321377735:5(7114-7125)Online publication date: May-2024
https://doi.org/10.1109/TNNLS.2022.3213777
Show More Cited By

Index Terms

Approximate kernel k-means: solution to large scale kernel clustering
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Adaptive Explicit Kernel Minkowski Weighted K-means
Abstract
The K-means algorithm is among the most commonly used data clustering methods. However, the regular K-means can only be applied in the input space, and it is applicable when clusters are linearly separable. The kernel K-means, which extends K-...
Proficient Normalised Fuzzy K-Means With Initial Centroids Methodology

This article describes how data is relevant and if it can be organized, linked with other data and grouped into a cluster. Clustering is the process of organizing a given set of objects into a set of disjoint groups called clusters. There are a number ...
The global kernel k-means algorithm for clustering in feature space

Kernel k-means is an extension of the standard k-means clustering algorithm that identifies nonlinearly separable clusters. In order to overcome the cluster initialization problem associated with this method, we propose the global kernel k-means ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

August 2011

1446 pages

ISBN:9781450308137

DOI:10.1145/2020408

General Chair:
Chid Apte
IBM Research
,
Program Chairs:
Joydeep Ghosh
UT Austin
,
Padhraic Smyth
UC Irvine

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Poster

Conference

KDD '11

Sponsor:

KDD '11: The 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 21 - 24, 2011

California, San Diego, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

121
Total Citations
View Citations
1,470
Total Downloads

Downloads (Last 12 months)42
Downloads (Last 6 weeks)5

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Du KJiang BLu JHua JSwamy M(2024)Exploring Kernel Machines and Support Vector Machines: Principles, Techniques, and Future DirectionsMathematics10.3390/math1224393512:24(3935)Online publication date: 13-Dec-2024
https://doi.org/10.3390/math12243935
Wu DYang ZLu JXu JXu XNie F(2024)EBMGC-GNF: Efficient Balanced Multi-View Graph Clustering via Good Neighbor FusionIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.339822046:12(7878-7892)Online publication date: Dec-2024
https://doi.org/10.1109/TPAMI.2024.3398220
Zhou XWang X(2024) Memory and Communication Efficient Federated Kernel k -Means IEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.321377735:5(7114-7125)Online publication date: May-2024
https://doi.org/10.1109/TNNLS.2022.3213777
Wang JTang CZheng XLiu XZhang WZhu EZhu X(2024)Fast Approximated Multiple Kernel K-MeansIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334074336:11(6171-6180)Online publication date: Nov-2024
https://doi.org/10.1109/TKDE.2023.3340743
He LZhang H(2024)Large-Scale Graph Sinkhorn Distance Approximation for Resource-Constrained DevicesIEEE Transactions on Consumer Electronics10.1109/TCE.2023.330089070:1(2960-2969)Online publication date: Feb-2024
https://doi.org/10.1109/TCE.2023.3300890
Ou QZhang PZhou SZhu E(2024)One-Step Late Fusion Multi-View Clustering with Compressed SubspaceICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447646(7765-7769)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10447646
Matin MAzadi M(2024)A novel machine learning-based model for predicting of transition fatigue lifetime in piston aluminum alloysInternational Journal of Lightweight Materials and Manufacture10.1016/j.ijlmm.2024.04.004Online publication date: Apr-2024
https://doi.org/10.1016/j.ijlmm.2024.04.004
Jiang SKrauthgamer RLou JZhang Y(2024)Coresets for kernel clusteringMachine Learning10.1007/s10994-024-06540-z113:8(5891-5906)Online publication date: 22-Apr-2024
https://doi.org/10.1007/s10994-024-06540-z
Bucarelli MLarsen MSchwiegelshohn CToftrup MOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)On generalization bounds for projective clusteringProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669262(71723-71754)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669262
Pimenta TCeschin FGregio A(2023)ANDROIDGYNY: Reviewing Clustering Techniques for Android Malware Family ClassificationDigital Threats: Research and Practice10.1145/35874715:1(1-35)Online publication date: 14-Mar-2023
https://dl.acm.org/doi/10.1145/3587471
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten