Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A Sequential Sampling Framework for Spectral k-Means Based on Efficient Bootstrap Accuracy Estimations: Application to Distributed Clustering

Published: 01 July 2012 Publication History

Abstract

The scalability of learning algorithms has always been a central concern for data mining researchers, and nowadays, with the rapid increase in data storage capacities and availability, its importance has increased. To this end, sampling has been studied by several researchers in an effort to derive sufficiently accurate models using only small data fractions. In this article we focus on spectral k-means, that is, the k-means approximation as derived by the spectral relaxation, and propose a sequential sampling framework that iteratively enlarges the sample size until the k-means results (objective function and cluster structure) become indistinguishable from the asymptotic (infinite-data) output. In the proposed framework we adopt a commonly applied principle in data mining research that considers the use of minimal assumptions concerning the data generating distribution. This restriction imposes several challenges, mainly related to the efficiency of the sequential sampling procedure. These challenges are addressed using elements of matrix perturbation theory and statistics. Moreover, although the main focus is on spectral k-means, we also demonstrate that the proposed framework can be generalized to handle spectral clustering.
The proposed sequential sampling framework is consecutively employed for addressing the distributed clustering problem, where the task is to construct a global model for data that resides in distributed network nodes. The main challenge in this context is related to the bandwidth constraints that are commonly imposed, thus requiring that the distributed clustering algorithm consumes a minimal amount of network load. This illustrates the applicability of the proposed approach, as it enables the determination of a minimal sample size that can be used for constructing an accurate clustering model that entails the distributional characteristics of the data. As opposed to the relevant distributed k-means approaches, our framework takes into account the fact that the choice of the number of clusters has a crucial effect on the required amount of communication. More precisely, the proposed algorithm is able to derive a statistical estimation of the required relative sizes for all possible values of k. This unique feature of our distributed clustering framework enables a network administrator to choose an economic solution that identifies the crude cluster structure of a dataset and not devote excessive network resources for identifying all the “correct” detailed clusters.

References

[1]
Ailon, N., Jaiswal, R., and Monteleoni, C. 2009. Streaming k-means approximation. In Advances in Neural Information Processing Systems 22, Y. Bengio et al. Eds., 10--18.
[2]
Arai, B., Lin, S., and Gunopulos, D. 2007. Efficient data sampling in heterogeneous peer-to-peer networks. In Proceedings of the 7th IEEE International Conference on Data Mining. IEEE, Los Alamitos, CA, 23--32.
[3]
Awan, A., Ferreira, R. A., Jagannathan, S., and Grama, A. 2006. Distributed uniform sampling in unstructured peer-to-peer networks. In Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS’06). IEEE, Los Alamitos, CA.
[4]
Bai, Z. D. 1999. Methodologies in spectral analysis of large dimensional random matrices. Statistica Sinica 9, 611--677.
[5]
Bandyopadhyay, S. and Coyle, E. J. 2003. An energy efficient hierarchical clustering algorithm for wireless sensor networks. In Proceedings of the 22nd Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM’03). IEEE, Los Alamitos, CA.
[6]
Bandyopadhyay, S., Giannella, C., Maulik, U., Kargupta, H., Liu, K., and Datta, S. 2006. Clustering distributed data streams in peer-to-peer environments. Inf. Sci. 176, 14, 1952--1985.
[7]
Banerjee, A. and Ghosh, J. 2002. On scaling up balanced clustering algorithms. In Proceedings of the 2nd SIAM International Conference on Data Mining (SDM’02). SIAM.
[8]
Bradley, P. S., Fayyad, U. M., and Reina, C. 1998. Scaling clustering algorithms to large databases. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD’98). 9--15.
[9]
Chung, K. 1974. A Course in Probability Theory. Academic Press.
[10]
Datta, S., Giannella, C., and Kargupta, H. 2006. K-means clustering over a large, dynamic network. In Proceedings of the 6th SIAM International Conference on Data Mining (SDM’06). SIAM.
[11]
Datta, S., Giannella, C. R., and Kargupta, H. 2009. Approximate distributed k-means clustering over a peer-to-peer network. IEEE Trans. Knowl. Data Eng. 21, 10, 1372--1388.
[12]
Dhillon, I. S. and Modha, D. S. 2000. A data-clustering algorithm on distributed memory multiprocessors. In Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems (SIGKDD). Springer, Berlin, 245--260.
[13]
Ding, C. and He, X. 2004. K-means clustering via principal component analysis. In Proceedings of the 21st International Conference on Machine Learning (ICML’04). ACM, New York.
[14]
Domingo, C., Gavaldà, R., and Watanabe, O. 2002. Adaptive sampling methods for scaling up knowledge discovery algorithms. Data Min. Knowl. Discov. 6, 131--152.
[15]
Domingos, P. and Hulten, G. 2001. A general method for scaling up machine learning algorithms and its application to clustering. In Proceedings of the 18th International Conference on Machine Learning (ICML’01). Morgan Kaufmann, San Francisco, CA, 106--113.
[16]
Drineas, P., Frieze, A., Kannan, R., Vempala, S., and Vinay, V. 1999. Clustering in large graphs and matrices. In Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’99). SIAM, Philadelphia, PA, 291--299.
[17]
Efron, B. and Tibshirani, R. 1993. An Introduction to the Bootstrap. Chapman Hall.
[18]
Forman, G. and Zhang, B. 2000. Distributed data clustering can be efficient and exact. SIGKDD Explor. Newsl. 2, 2, 34--38.
[19]
Fowlkes, C., Belongie, S., Chung, F. R. K., and Malik, J. 2004. Spectral grouping using the Nystrom method. IEEE Trans. Pattern Anal. Mach. Intell. 26, 2, 214--225.
[20]
Gordon, A. and Henderson, J. 1977. An algorithm for Euclidean sum of squares classification. Biometrics.
[21]
Guha, S., Rastogi, R., and Shim, K. 1998. Cure: An efficient clustering algorithm for large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’98). ACM, New York, 73--84.
[22]
Hammouda, K. M. and Kamel, M. S. 2007. Hp2pc: Scalable hierarchically-distributed peer-to-peer clustering. In Proceedings of the 7th SIAM International Conference on Data Mining (SDM’07). SIAM.
[23]
Huang, L., Yan, D., Jordan, M. I., and Taft, N. 2009. Spectral clustering with perturbed data. In Proceedings of the 22nd Annual Conference on Neural Information Processing Systems (NIPS’08). MIT Press, Cambridge, MA, 705--712.
[24]
Januzaj, E., Kriegel, H.-P., and Pfeifle, M. 2004. Scalable density-based distributed clustering. In Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’04). Springer, Berlin, 231--244.
[25]
Kargupta, H., Huang, W., Sivakumar, K., Park, B.-H., and Wang, S. 2000. Collective principal component analysis from distributed, heterogeneous data. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD’00). Springer, Berlin, 452--457.
[26]
Klusch, M., Lodi, S., and Moro, G. 2003. Distributed clustering based on sampling local density estimates. In Proceedings of the 18th International Joint Conference on Artificial Intelligence. Morgan Kaufmann, San Francisco, CA, 485--490.
[27]
Kriegel, H.-P., Kroger, P., Pryakhin, A., and Schubert, M. 2005. Effective and efficient distributed model-based clustering. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM’05). IEEE, Los Alamitos, CA, 258--265.
[28]
Lloyd, S. 1982. Least squares quantization in PCM. IEEE Trans. Inf. Theory 28, 129--137.
[29]
Mavroeidis, D. and Bingham, E. 2008. Enhancing the stability of spectral ordering with sparsification and partial supervision: Application to paleontological data. In Proceedings of the 8th IEEE International Conference on Data Mining. IEEE, Los Alamitos, CA, 462--471.
[30]
Mavroeidis, D. and Bingham, E. 2010. Enhancing the stability and efficiency of spectral ordering with partial supervision and feature selection. Knowl. Inf. Syst. 23, 2, 243--265.
[31]
Mavroeidis, D. and Vazirgiannis, M. 2007. Stability based sparse LSI/PCA: Incorporating feature selection in LSI and PCA. In Proceedings of the 18th European Conference on Machine Learning (ECML’07). Springer, Berlin, 226--237.
[32]
Papadimitriou, C. H., Tamaki, H., Raghavan, P., and Vempala, S. 1998. Latent semantic indexing: A probabilistic analysis. In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’98). ACM, New York, 159--168.
[33]
Provost, F., Jensen, D., and Oates, T. 1999. Efficient progressive sampling. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’99). ACM, New York, 23--32.
[34]
Provost, F. J. and Kolluri, V. 1999. A survey of methods for scaling up inductive algorithms. Data Min. Knowl. Discov. 3, 2, 131--169.
[35]
Qi, H., Wang, T., and Birdwell, D. 2004. Global principal component analysis for dimensionality reduction in distributed data mining. In Statistical Data Mining and Knowledge Discovery, CRC Press, 327--342.
[36]
Scheffer, T. and Wrobel, S. 2003. Finding the most interesting patterns in a database quickly by using sequential sampling. J. Mach. Learn. Res. 3, 833--862.
[37]
Scholz, M. 2005. Sampling-based sequential subgroup mining. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD’05). ACM, New York, 265--274.
[38]
Shi, J. and Malik, J. 2000. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22, 8, 888--905.
[39]
Stewart, G. and Sun, J.-G. 1990. Matrix Perturbation Theory. Academic Press.
[40]
Tasoulis, D. K. and Vrahatis, M. N. 2004. Unsupervised distributed clustering. In Parallel and Distributed Computing and Networks, M. H. Hamza Ed., IASTED/ACTA Press, 347--351.
[41]
von Luxburg, U. 2007. A tutorial on spectral clustering. Stat. Comput. 17, 4, 395--416.
[42]
von Luxburg, U., Belkin, M., and Bousquet, O. 2008. Consistency of spectral clustering. Ann. Stat. 36, 2, 555--586.
[43]
Wu, J., Xiong, H., and Chen, J. 2009. Adapting the right measures for k-means clustering. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09). ACM, New York, 877--886.
[44]
Younis, O. and Fahmy, S. 2004. Distributed clustering in ad-hoc sensor networks: A hybrid, energy-efficient approach. In Proceedings of the IEEE 23rd Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM’04). IEEE, Los Alamitos, CA.
[45]
Zha, H., He, X., Ding, C. H. Q., Gu, M., and Simon, H. D. 2001. Spectral relaxation for k-means clustering. In Advances in Neural Information Processing Systems 14 (NIPS’01). MIT Press, 1057--1064.
[46]
Zhang, Q., Liu, J., and Wang, W. 2008. Approximate clustering on distributed data streams. In Proceedings of the IEEE 24th International Conference on Data Engineering. IEEE, Los Alamitos, CA, 1131--1139.
[47]
Zhou, A., Cao, F., Yan, Y., Sha, C., and He, X. 2007. Distributed data stream clustering: A fast EM-based approach. In Proceedings of the 23rd International Conference on Data Engineering (ICDE’07). IEEE, Los Alamitos, CA., 736--745.

Cited By

View all
  • (2022)Distributed K-Clustering with Exponential Convergence2022 American Control Conference (ACC)10.23919/ACC53348.2022.9867551(4256-4261)Online publication date: 8-Jun-2022
  • (2018)Feature selection for k-means clustering stabilityData Mining and Knowledge Discovery10.1007/s10618-013-0320-328:4(918-960)Online publication date: 26-Dec-2018
  • (2017)A context extraction and profiling engine for 5G network resource mappingComputer Communications10.1016/j.comcom.2017.06.003109(184-201)Online publication date: Sep-2017
  • Show More Cited By

Index Terms

  1. A Sequential Sampling Framework for Spectral k-Means Based on Efficient Bootstrap Accuracy Estimations: Application to Distributed Clustering

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Knowledge Discovery from Data
      ACM Transactions on Knowledge Discovery from Data  Volume 6, Issue 2
      July 2012
      144 pages
      ISSN:1556-4681
      EISSN:1556-472X
      DOI:10.1145/2297456
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 July 2012
      Accepted: 01 June 2011
      Revised: 01 June 2011
      Received: 01 January 2009
      Published in TKDD Volume 6, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Asymptotic convergence
      2. bootstrapping
      3. clustering
      4. distributed clustering
      5. matrix perturbation theory
      6. sampling
      7. spectral

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)11
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 03 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Distributed K-Clustering with Exponential Convergence2022 American Control Conference (ACC)10.23919/ACC53348.2022.9867551(4256-4261)Online publication date: 8-Jun-2022
      • (2018)Feature selection for k-means clustering stabilityData Mining and Knowledge Discovery10.1007/s10618-013-0320-328:4(918-960)Online publication date: 26-Dec-2018
      • (2017)A context extraction and profiling engine for 5G network resource mappingComputer Communications10.1016/j.comcom.2017.06.003109(184-201)Online publication date: Sep-2017
      • (2013)Towards More Efficient Image Web SearchIntelligent Information Management10.4236/iim.2013.5602205:06(196-203)Online publication date: 2013

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media