Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1458082.1458226acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Data weaving: scaling up the state-of-the-art in data clustering

Published: 26 October 2008 Publication History

Abstract

The enormous amount and dimensionality of data processed by modern data mining tools require effective, scalable unsupervised learning techniques. Unfortunately, the majority of previously proposed clustering algorithms are either effective or scalable. This paper is concerned with information-theoretic clustering (ITC) that has historically been considered the state-of-the-art in clustering multi-dimensional data. Most existing ITC methods are computationally expensive and not easily scalable. Those few ITC methods that scale well (using, e.g., parallelization) are often outperformed by the others, of an inherently sequential nature. First, we justify this observation theoretically. We then propose data weaving - a novel method for parallelizing sequential clustering algorithms. Data weaving is intrinsically multi-modal - it allows simultaneous clustering of a few types of data (modalities). Finally, we use data weaving to parallelize multi-modal ITC, which results in proposing a powerful DataLoom algorithm. In our experimentation with small datasets, DataLoom shows practically identical performance compared to expensive sequential alternatives. On large datasets, however, DataLoom demonstrates significant gains over other parallel clustering methods. To illustrate the scalability, we simultaneously clustered rows and columns of a contingency table with over 120 billion entries.

References

[1]
R. Bekkerman, R. El-Yaniv, and A. McCallum. Multi-way distributional clustering via pairwise interactions. In Proceedings of ICML-22, pages 41--48, 2005.
[2]
R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter. On feature distributional clustering for text categorization. In Proceedings of SIGIR, pages 146--153, 2001.
[3]
R. Bekkerman, M. Sahami, and E. Learned-Miller. Combinatorial Markov Random Fields. In Proceedings of ECML-17, 2006.
[4]
J. Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society, 48(3), 1986.
[5]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003.
[6]
C. Böhm, C. Faloutsos, J.-Y. Pan, and C. Plant. Robust information-theoretic clustering. In Proceedings of ACM SIGKDD, pages 65--75, 2006.
[7]
S. Brecheisen, H.-P. Kriegel, and M. Pfeifle. Parallel density-based clustering of complex objects. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2006.
[8]
C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In Advances in Neural Information Processing Systems (NIPS), 2006.
[9]
K. Crammer, P. Talukdar, and F. Pereira. A rate-distortion one-class model and its applications to clustering. In Proceedings of the 25st International Conference on Machine Learning, 2008.
[10]
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Symposium on Operating System Design and Implementation (OSDI), pages 137--150, 2004.
[11]
D. Deb and R. A. Angryk. Distributed document clustering using word-clusters. In IEEE Symposium on Computational Intelligence and Data Mining (CIDM), pages 376--383, 2007.
[12]
I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In Proceedings of SIGKDD-9, pages 89--98, 2003.
[13]
I. S. Dhillon and D. S. Modha. A data clustering algorithm on distributed memory multiprocessors. In Large-Scale Parallel Data Mining, volume 1759 of Lecture Notes in Artificial Intelligence, 2000.
[14]
R. El-Yaniv and O. Souroujon. Iterative double clustering for unsupervised and semi-supervised learning. In Advances in Neural Information Processing Systems (NIPS-14), 2001.
[15]
G. Forman and B. Zhang. Distributed data clustering can be efficient and exact. SIGKDD Exploration Newsletter, 2(2):34--38, 2000.
[16]
N. Friedman, O. Mosenzon, N. Slonim, and N. Tishby. Multivariate information bottleneck. In Proceedings of UAI-17, 2001.
[17]
B. Gao, T.-Y. Liu, X. Zheng, Q.-S. Cheng, and W.-Y. Ma. Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering. In Proceedings of ACM SIGKDD, 2005.
[18]
P. E. Hadjidoukas and L. Amsaleg. Parallelization of a hierarchical data clustering algorithm using openmp. In Proceedings of the International Workshop on OpenMP (IWOMP), Reims, France, June 2006.
[19]
M. Johnson, R. H. Liao, A. Rasmussen, R. Sridharan, D. Garcia, and B. Harvey. Infusing parallelism into introductory computer science using mapreduce. In Proceedings of SIGCSE: Symposium on Computer Science Education, 2008.
[20]
D. Judd, P. K. McKinley, and A. K. Jain. Large-scale parallel data clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):871--876, 1998.
[21]
S. L. Lauritzen. Graphical Models. Clarendon Press, 1996.
[22]
D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. JMLR, 5:361--397, 2004.
[23]
K. Liu, H. Kargupta, J. Ryan, and K. Bhaduri. Distributed data mining bibliography. http://www.cs.umbc.edu/~hillol/DDMBIB, 2004.
[24]
B. Long, Z. Zhang, and P. S. Yu. A probabilistic framework for relational clustering. In Proceedings of the ACM SIGKDD, pages 470--479, 2007.
[25]
A. McCallum, A. Corrada-Emmanuel, and X. Wang. Topic and role discovery in social networks. In Proceedings of IJCAI-19, pages 786--791, 2005.
[26]
A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of ACM SIGKDD, pages 169--178, 2000.
[27]
R. Pensa and J.-F. Boulicaut. Constrained co-clustering of gene expression data. In Proceedings of the 2008 SIAM International Conference on Data Mining, pages 25--36, 2008.
[28]
R. Rocci and M. Vichi. Two-mode multi-partitioning. Computational Statistics and Data Analysis, 52(4), 2008.
[29]
N. Rooney, D. Patterson, M. Galushka, and V. Dobrynin. A scaleable document clustering approach for large document corpora. Information Processing and Management, 42(5):1163--1175, 2006.
[30]
N. Slonim, N. Friedman, and N. Tishby. Unsupervised document classification using sequential information maximization. In Proceedings of SIGIR-25, 2002.
[31]
N. Slonim and N. Tishby. Agglomerative information bottleneck. In Advances in Neural Information Processing Systems 12 (NIPS), pages 617--623, 2000.
[32]
M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra. MPI - The Complete Reference: Volume 1, The MPI Core. MIT Press, 2nd edition, 1998.
[33]
C. Sutton and A. McCallum. Piecewise training of undirected models. In Proceedings of UAI-21, 2005.
[34]
N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method, 1999. Invited paper to the 37th Annual Allerton Conference on Communication, Control, and Computing.
[35]
X. Xu, J. Jäger, and H.-P. Kriegel. A fast parallel clustering algorithm for large spatial databases. Data Mining and Knowledge Discovery, 3(3):263--290, 1999.

Cited By

View all
  • (2023)A New Sparse Data Clustering Method Based On Frequent ItemsProceedings of the ACM on Management of Data10.1145/35886851:1(1-28)Online publication date: 30-May-2023
  • (2015)Efficient distance metric learning by adaptive sampling and mini-batch stochastic gradient descent (SGD)Machine Language10.1007/s10994-014-5456-x99:3(353-372)Online publication date: 1-Jun-2015
  • (2014)Unsupervised classification and visualization of unstructured text for the support of interdisciplinary collaborationProceedings of the 17th ACM conference on Computer supported cooperative work & social computing10.1145/2531602.2531666(1033-1042)Online publication date: 15-Feb-2014
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management
October 2008
1562 pages
ISBN:9781595939913
DOI:10.1145/1458082
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. information-theoretic clustering
  2. multi-modal clustering
  3. parallel and distributed data mining

Qualifiers

  • Research-article

Conference

CIKM08
CIKM08: Conference on Information and Knowledge Management
October 26 - 30, 2008
California, Napa Valley, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2023)A New Sparse Data Clustering Method Based On Frequent ItemsProceedings of the ACM on Management of Data10.1145/35886851:1(1-28)Online publication date: 30-May-2023
  • (2015)Efficient distance metric learning by adaptive sampling and mini-batch stochastic gradient descent (SGD)Machine Language10.1007/s10994-014-5456-x99:3(353-372)Online publication date: 1-Jun-2015
  • (2014)Unsupervised classification and visualization of unstructured text for the support of interdisciplinary collaborationProceedings of the 17th ACM conference on Computer supported cooperative work & social computing10.1145/2531602.2531666(1033-1042)Online publication date: 15-Feb-2014
  • (2010)Learning Preferences with Millions of Parameters by Enforcing SparsityProceedings of the 2010 IEEE International Conference on Data Mining10.1109/ICDM.2010.67(779-784)Online publication date: 13-Dec-2010
  • (2010)Scalable parallel co-clustering over multiple heterogeneous data types2010 International Conference on High Performance Computing & Simulation10.1109/HPCS.2010.5547087(529-535)Online publication date: Jun-2010
  • (2009)An efficient clustering algorithm for large-scale topical web pagesProceedings of the 18th ACM conference on Information and knowledge management10.1145/1645953.1646247(1851-1854)Online publication date: 2-Nov-2009
  • (2009)Improving clustering stability with combinatorial MRFsProceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/1557019.1557037(99-108)Online publication date: 28-Jun-2009

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media