Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2020408.2020558acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
poster

Approximate kernel k-means: solution to large scale kernel clustering

Published: 21 August 2011 Publication History

Abstract

Digital data explosion mandates the development of scalable tools to organize the data in a meaningful and easily accessible form. Clustering is a commonly used tool for data organization. However, many clustering algorithms designed to handle large data sets assume linear separability of data and hence do not perform well on real world data sets. While kernel-based clustering algorithms can capture the non-linear structure in data, they do not scale well in terms of speed and memory requirements when the number of objects to be clustered exceeds tens of thousands. We propose an approximation scheme for kernel k-means, termed approximate kernel k-means, that reduces both the computational complexity and the memory requirements by employing a randomized approach. We show both analytically and empirically that the performance of approximate kernel k-means is similar to that of the kernel k-means algorithm, but with dramatically reduced run-time complexity and memory requirements.

References

[1]
http://gigaom.files.wordpress.com/2010/05/2010-digital-universe-iview_5--4%-10.pdf.
[2]
http://yann.lecun.com/exdb/mnist.
[3]
C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu. A framework for clustering evolving data streams. In Proceedings of the International Conference on Very Large Databases, pages 81--92, 2003.
[4]
M.A. Belabbas and P.J. Wolfe. Spectral methods in machine learning and new strategies for very large datasets. Proceedings of the National Academy of Sciences, 106(2):369--374, 2009.
[5]
F. Can. Incremental clustering for dynamic information processing. ACM Transactions on Information Systems, 11(2):143--164, 1993.
[6]
F. Can, E.A. Fox, C.D. Snavely, and R.K. France. Incremental clustering for very large document databases: Initial MARIAN experience. Information Sciences, 84(1--2):101--114, 1995.
[7]
C.T. Chu, S.K. Kim, Y.A. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In Advances in Neural Information Processing Systems 19, pages 281--288, 2007.
[8]
S. Daruru, N.M. Marin, M. Walker, and J. Ghosh. Pervasive parallelism in data mining: dataflow solution to co-clustering large and sparse netflix data. In Proceedings of the SIGKDD conference on Knowledge Discovery and Data mining, pages 1115--1124, 2009.
[9]
A.S. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In Proceedings of the International Conference on World Wide Web, pages 271--280, 2007.
[10]
S. Datta, C. Giannella, and H. Kargupta. K-means clustering over a large, dynamic network. In Proceedings of the SIAM Data Mining Conference, pages 153--164, 2006.
[11]
J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009.
[12]
I. Dhillon, Y. Guan, and B. Kulis. A unified view of kernel k-means, spectral clustering and graph cuts. Technical report, University of Texas at Austin, 2004. (Tech. rep. TR-04--25).
[13]
C. Ding, X. He, and H.D. Simon. On the equivalence of nonnegative matrix factorization and spectral clustering. In Proceedings of the SIAM Data Mining Conference, pages 606--610, 2005.
[14]
P. Drineas and M.W. Mahoney. On the Nystrom method for approximating a Gram matrix for improved kernel-based learning. The Journal of Machine Learning Research, 6:2153--2175, 2005.
[15]
D. Foti, D. Lipari, C. Pizzuti, and D. Talia. Scalable parallel clustering for data mining on multicomputers. Parallel and Distributed Processing, pages 390--398, 2000.
[16]
C. Fowlkes, S. Belongie, F. Chung, and J. Malik. Spectral grouping using the Nystrom method. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 214--225, 2004.
[17]
M. Girolami. Mercer kernel-based clustering in feature space. IEEE Transactions on Neural Networks, 13(3):780--784, 2002.
[18]
S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering, pages 515--528, 2003.
[19]
S. Guha, R. Rastogi, and K. Shim. Cure: an efficient clustering algorithm for large databases. Information Systems, 26(1):35--58, 2001.
[20]
W. Hackbusch. Integral equations: Theory and Numerical treatment. Birkhauser, 1995.
[21]
S. Har-Peled and S. Mazumdar. On coresets for k-means and k-median clustering. In Proceedings of the ACM Symposium on Theory of Computing, pages 291--300, 2004.
[22]
L. Hubert and P. Arabie. Comparing partitions. Journal of classification, 2(1):193--218, 1985.
[23]
R. Inokuchi and S. Miyamoto. LVQ clustering and SOM using a kernel function. In IEEE International Conference on Fuzzy Systems, volume 3, pages 1497--1500, 2005.
[24]
D. Jiang, C. Tang, and A. Zhang. Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge and Data Engineering, 16(11):1370--1386, 2004.
[25]
D. Judd, P.K. McKinley, and A.K. Jain. Large-scale parallel data clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):871--876, 1998.
[26]
L. Kaufman and P.J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Blackwell, 2005.
[27]
D.W. Kim, K.Y. Lee, D. Lee, and K.H. Lee. Evaluation of the performance of clustering algorithms in kernel-induced feature space. Pattern Recognition, 38(4):607--611, 2005.
[28]
S. Kumar, M. Mohri, and A. Talwalkar. Sampling techniques for the Nystrom method. In Proceedings of the Conference on Artificial Intelligence and Statistics, pages 304--311, 2009.
[29]
T.O. Kvalseth. Entropy and correlation: Some comments. IEEE Transactions on Systems, Man and Cybernetics, 17(3):517--519, 1987.
[30]
S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 2169--2178, 2006.
[31]
L.L. Liu, X.B. Wen, and X.X. Gao. Segmentation for SAR Image Based on a New Spectral Clustering Algorithm. Life System Modeling and Intelligent Computing, pages 635--643, 2010.
[32]
D. MacDonald and C. Fyfe. The kernel self-organising map. In Proceedings of the International Conference on Knowledge-Based Intelligent Engineering Systems and Allied Technologies, volume 1, pages 317--320, 2002.
[33]
D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent dirichlet allocation. In Advances in Neural Information Processing Systems, volume 20, pages 17--24, 2007.
[34]
A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14, pages 849--856, 2001.
[35]
R.T. Ng and J. Han. CLARANS: A method for clustering objects for spatial data mining. IEEE Transactions on Knowledge and Data Engineering, pages 1003--1016, 2002.
[36]
A. K. Qinand and P. N. Suganthan. Kernel neural gas algorithms with application to cluster analysis. Pattern Recognition, 4:617--620, 2004.
[37]
T. Sakai and A. Imiya. Fast spectral clustering with random projection and sampling. Machine Learning and Data Mining in Pattern Recognition, pages 372--384, 2009.
[38]
B. Scholkopf, R. Herbrich, and A. Smola. A generalized representer theorem. In Proceedings of Computational Learning Theory, pages 416--426, 2001.
[39]
B. Scholkopf, A. Smola, and K.R. Muller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299--1314, 1996.
[40]
J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.
[41]
J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888--905, 2002.
[42]
A. Vedaldi and B. Fulkerson. Vlfeat: An open and portable library of computer vision algorithms. http://www.vlfeat.org, 2008.
[43]
C. Williams and M. Seeger. Using the Nystrom method to speed up kernel machines. In Advances in Neural Information Processing Systems 13, pages 682--688, 2001.
[44]
W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In Proceedings of the ACM SIGIR Conference, pages 267--273, 2003.
[45]
H. Zha, X. He, C. Ding, M. Gu, and H. Simon. Spectral relaxation for k-means clustering. In Advances in Neural Information Processing Systems, volume 2, pages 1057--1064, 2002.
[46]
R. Zhang and A.I. Rudnicky. A large scale clustering scheme for kernel k-means. In Proceedings of the International Conference on Pattern Recognition, pages 289--292, 2002.
[47]
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Record, 25(2):103--114, 1996.

Cited By

View all
  • (2024)Exploring Kernel Machines and Support Vector Machines: Principles, Techniques, and Future DirectionsMathematics10.3390/math1224393512:24(3935)Online publication date: 13-Dec-2024
  • (2024)EBMGC-GNF: Efficient Balanced Multi-View Graph Clustering via Good Neighbor FusionIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.339822046:12(7878-7892)Online publication date: Dec-2024
  • (2024) Memory and Communication Efficient Federated Kernel k -Means IEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.321377735:5(7114-7125)Online publication date: May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2011
1446 pages
ISBN:9781450308137
DOI:10.1145/2020408
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. k-means
  2. kernel clustering
  3. large-scale clustering

Qualifiers

  • Poster

Conference

KDD '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)42
  • Downloads (Last 6 weeks)5
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Exploring Kernel Machines and Support Vector Machines: Principles, Techniques, and Future DirectionsMathematics10.3390/math1224393512:24(3935)Online publication date: 13-Dec-2024
  • (2024)EBMGC-GNF: Efficient Balanced Multi-View Graph Clustering via Good Neighbor FusionIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.339822046:12(7878-7892)Online publication date: Dec-2024
  • (2024) Memory and Communication Efficient Federated Kernel k -Means IEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.321377735:5(7114-7125)Online publication date: May-2024
  • (2024)Fast Approximated Multiple Kernel K-MeansIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334074336:11(6171-6180)Online publication date: Nov-2024
  • (2024)Large-Scale Graph Sinkhorn Distance Approximation for Resource-Constrained DevicesIEEE Transactions on Consumer Electronics10.1109/TCE.2023.330089070:1(2960-2969)Online publication date: Feb-2024
  • (2024)One-Step Late Fusion Multi-View Clustering with Compressed SubspaceICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447646(7765-7769)Online publication date: 14-Apr-2024
  • (2024)A novel machine learning-based model for predicting of transition fatigue lifetime in piston aluminum alloysInternational Journal of Lightweight Materials and Manufacture10.1016/j.ijlmm.2024.04.004Online publication date: Apr-2024
  • (2024)Coresets for kernel clusteringMachine Learning10.1007/s10994-024-06540-z113:8(5891-5906)Online publication date: 22-Apr-2024
  • (2023)On generalization bounds for projective clusteringProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669262(71723-71754)Online publication date: 10-Dec-2023
  • (2023)ANDROIDGYNY: Reviewing Clustering Techniques for Android Malware Family ClassificationDigital Threats: Research and Practice10.1145/35874715:1(1-35)Online publication date: 14-Mar-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media