research-article

A Sequential Sampling Framework for Spectral k-Means Based on Efficient Bootstrap Accuracy Estimations: Application to Distributed Clustering

Authors:

Dimitrios Mavroeidis,

Panagis MagdalinosAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 6, Issue 2

Article No.: 5, Pages 1 - 37

https://doi.org/10.1145/2297456.2297457

Published: 01 July 2012 Publication History

Get Access

Abstract

The scalability of learning algorithms has always been a central concern for data mining researchers, and nowadays, with the rapid increase in data storage capacities and availability, its importance has increased. To this end, sampling has been studied by several researchers in an effort to derive sufficiently accurate models using only small data fractions. In this article we focus on spectral k-means, that is, the k-means approximation as derived by the spectral relaxation, and propose a sequential sampling framework that iteratively enlarges the sample size until the k-means results (objective function and cluster structure) become indistinguishable from the asymptotic (infinite-data) output. In the proposed framework we adopt a commonly applied principle in data mining research that considers the use of minimal assumptions concerning the data generating distribution. This restriction imposes several challenges, mainly related to the efficiency of the sequential sampling procedure. These challenges are addressed using elements of matrix perturbation theory and statistics. Moreover, although the main focus is on spectral k-means, we also demonstrate that the proposed framework can be generalized to handle spectral clustering.

The proposed sequential sampling framework is consecutively employed for addressing the distributed clustering problem, where the task is to construct a global model for data that resides in distributed network nodes. The main challenge in this context is related to the bandwidth constraints that are commonly imposed, thus requiring that the distributed clustering algorithm consumes a minimal amount of network load. This illustrates the applicability of the proposed approach, as it enables the determination of a minimal sample size that can be used for constructing an accurate clustering model that entails the distributional characteristics of the data. As opposed to the relevant distributed k-means approaches, our framework takes into account the fact that the choice of the number of clusters has a crucial effect on the required amount of communication. More precisely, the proposed algorithm is able to derive a statistical estimation of the required relative sizes for all possible values of k. This unique feature of our distributed clustering framework enables a network administrator to choose an economic solution that identifies the crude cluster structure of a dataset and not devote excessive network resources for identifying all the “correct” detailed clusters.

References

[1]

Ailon, N., Jaiswal, R., and Monteleoni, C. 2009. Streaming k-means approximation. In Advances in Neural Information Processing Systems 22, Y. Bengio et al. Eds., 10--18.

Abstract

References

Cited By

Index Terms

Recommendations

A Comparative Analysis of Distributed Clustering Algorithms: A Survey

Clustering stability-based Evolutionary K-Means

On the Added Value of Bootstrap Analysis for K-Means Clustering

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations