research-article

Data weaving: scaling up the state-of-the-art in data clustering

Authors:

Martin ScholzAuthors Info & Claims

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

Pages 1083 - 1092

https://doi.org/10.1145/1458082.1458226

Published: 26 October 2008 Publication History

Abstract

The enormous amount and dimensionality of data processed by modern data mining tools require effective, scalable unsupervised learning techniques. Unfortunately, the majority of previously proposed clustering algorithms are either effective or scalable. This paper is concerned with information-theoretic clustering (ITC) that has historically been considered the state-of-the-art in clustering multi-dimensional data. Most existing ITC methods are computationally expensive and not easily scalable. Those few ITC methods that scale well (using, e.g., parallelization) are often outperformed by the others, of an inherently sequential nature. First, we justify this observation theoretically. We then propose data weaving - a novel method for parallelizing sequential clustering algorithms. Data weaving is intrinsically multi-modal - it allows simultaneous clustering of a few types of data (modalities). Finally, we use data weaving to parallelize multi-modal ITC, which results in proposing a powerful DataLoom algorithm. In our experimentation with small datasets, DataLoom shows practically identical performance compared to expensive sequential alternatives. On large datasets, however, DataLoom demonstrates significant gains over other parallel clustering methods. To illustrate the scalability, we simultaneously clustered rows and columns of a contingency table with over 120 billion entries.

References

[1]

R. Bekkerman, R. El-Yaniv, and A. McCallum. Multi-way distributional clustering via pairwise interactions. In Proceedings of ICML-22, pages 41--48, 2005.

Digital Library

[2]

R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter. On feature distributional clustering for text categorization. In Proceedings of SIGIR, pages 146--153, 2001.

Digital Library

[3]

R. Bekkerman, M. Sahami, and E. Learned-Miller. Combinatorial Markov Random Fields. In Proceedings of ECML-17, 2006.

Digital Library

[4]

J. Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society, 48(3), 1986.

[5]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003.

Digital Library

[6]

C. Böhm, C. Faloutsos, J.-Y. Pan, and C. Plant. Robust information-theoretic clustering. In Proceedings of ACM SIGKDD, pages 65--75, 2006.

Digital Library

[7]

S. Brecheisen, H.-P. Kriegel, and M. Pfeifle. Parallel density-based clustering of complex objects. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2006.

Digital Library

[8]

C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In Advances in Neural Information Processing Systems (NIPS), 2006.

[9]

K. Crammer, P. Talukdar, and F. Pereira. A rate-distortion one-class model and its applications to clustering. In Proceedings of the 25st International Conference on Machine Learning, 2008.

Digital Library

[10]

J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Symposium on Operating System Design and Implementation (OSDI), pages 137--150, 2004.

Digital Library

[11]

D. Deb and R. A. Angryk. Distributed document clustering using word-clusters. In IEEE Symposium on Computational Intelligence and Data Mining (CIDM), pages 376--383, 2007.

[12]

I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In Proceedings of SIGKDD-9, pages 89--98, 2003.

Digital Library

[13]

I. S. Dhillon and D. S. Modha. A data clustering algorithm on distributed memory multiprocessors. In Large-Scale Parallel Data Mining, volume 1759 of Lecture Notes in Artificial Intelligence, 2000.

Digital Library

[14]

R. El-Yaniv and O. Souroujon. Iterative double clustering for unsupervised and semi-supervised learning. In Advances in Neural Information Processing Systems (NIPS-14), 2001.

[15]

G. Forman and B. Zhang. Distributed data clustering can be efficient and exact. SIGKDD Exploration Newsletter, 2(2):34--38, 2000.

Digital Library

[16]

N. Friedman, O. Mosenzon, N. Slonim, and N. Tishby. Multivariate information bottleneck. In Proceedings of UAI-17, 2001.

Digital Library

[17]

B. Gao, T.-Y. Liu, X. Zheng, Q.-S. Cheng, and W.-Y. Ma. Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering. In Proceedings of ACM SIGKDD, 2005.

Digital Library

[18]

P. E. Hadjidoukas and L. Amsaleg. Parallelization of a hierarchical data clustering algorithm using openmp. In Proceedings of the International Workshop on OpenMP (IWOMP), Reims, France, June 2006.

[19]

M. Johnson, R. H. Liao, A. Rasmussen, R. Sridharan, D. Garcia, and B. Harvey. Infusing parallelism into introductory computer science using mapreduce. In Proceedings of SIGCSE: Symposium on Computer Science Education, 2008.

[20]

D. Judd, P. K. McKinley, and A. K. Jain. Large-scale parallel data clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):871--876, 1998.

Digital Library

[21]

S. L. Lauritzen. Graphical Models. Clarendon Press, 1996.

[22]

D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. JMLR, 5:361--397, 2004.

Digital Library

[23]

K. Liu, H. Kargupta, J. Ryan, and K. Bhaduri. Distributed data mining bibliography. http://www.cs.umbc.edu/~hillol/DDMBIB, 2004.

[24]

B. Long, Z. Zhang, and P. S. Yu. A probabilistic framework for relational clustering. In Proceedings of the ACM SIGKDD, pages 470--479, 2007.

Digital Library

[25]

A. McCallum, A. Corrada-Emmanuel, and X. Wang. Topic and role discovery in social networks. In Proceedings of IJCAI-19, pages 786--791, 2005.

Digital Library

[26]

A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of ACM SIGKDD, pages 169--178, 2000.

Digital Library

[27]

R. Pensa and J.-F. Boulicaut. Constrained co-clustering of gene expression data. In Proceedings of the 2008 SIAM International Conference on Data Mining, pages 25--36, 2008.

[28]

R. Rocci and M. Vichi. Two-mode multi-partitioning. Computational Statistics and Data Analysis, 52(4), 2008.

Digital Library

[29]

N. Rooney, D. Patterson, M. Galushka, and V. Dobrynin. A scaleable document clustering approach for large document corpora. Information Processing and Management, 42(5):1163--1175, 2006.

Digital Library

[30]

N. Slonim, N. Friedman, and N. Tishby. Unsupervised document classification using sequential information maximization. In Proceedings of SIGIR-25, 2002.

Digital Library

[31]

N. Slonim and N. Tishby. Agglomerative information bottleneck. In Advances in Neural Information Processing Systems 12 (NIPS), pages 617--623, 2000.

[32]

M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra. MPI - The Complete Reference: Volume 1, The MPI Core. MIT Press, 2nd edition, 1998.

Digital Library

[33]

C. Sutton and A. McCallum. Piecewise training of undirected models. In Proceedings of UAI-21, 2005.

[34]

N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method, 1999. Invited paper to the 37th Annual Allerton Conference on Communication, Control, and Computing.

[35]

X. Xu, J. Jäger, and H.-P. Kriegel. A fast parallel clustering algorithm for large spatial databases. Data Mining and Knowledge Discovery, 3(3):263--290, 1999.

Digital Library

Cited By

Huang QLuo PTung A(2023)A New Sparse Data Clustering Method Based On Frequent ItemsProceedings of the ACM on Management of Data10.1145/35886851:1(1-28)Online publication date: 30-May-2023
https://doi.org/10.1145/3588685
Qian QJin RYi JZhang LZhu S(2015)Efficient distance metric learning by adaptive sampling and mini-batch stochastic gradient descent (SGD)Machine Language10.1007/s10994-014-5456-x99:3(353-372)Online publication date: 1-Jun-2015
https://dl.acm.org/doi/10.1007/s10994-014-5456-x
Miller LGazan RStill SFussell SLutters WMorris MReddy M(2014)Unsupervised classification and visualization of unstructured text for the support of interdisciplinary collaborationProceedings of the 17th ACM conference on Computer supported cooperative work & social computing10.1145/2531602.2531666(1033-1042)Online publication date: 15-Feb-2014
https://dl.acm.org/doi/10.1145/2531602.2531666
Show More Cited By

Index Terms

Data weaving: scaling up the state-of-the-art in data clustering
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
  2. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Towards information-theoretic K-means clustering for image indexing

Information-theoretic K-means (Info-Kmeans) aims to cluster high-dimensional data, such as images featured by the bag-of-features (BOF) model, using K-means algorithm with KL-divergence as the distance. While research efforts along this line have shown ...
SAIL: summation-based incremental learning for information-theoretic clustering
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Information-theoretic clustering aims to exploit information theoretic measures as the clustering criteria. A common practice on this topic is so-called INFO-K-means, which performs K-means clustering with the KL-divergence as the proximity function. ...
Gromov-Wasserstein Multi-modal Alignment and Clustering
CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management

Multi-modal clustering aims at finding a clustering structure shared by the data of different modalities in an unsupervised way. Currently, solving this problem often relies on two assumptions: i) the multi-modal data own the same latent distribution, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

October 2008

1562 pages

ISBN:9781595939913

DOI:10.1145/1458082

General Chair:
James G. Shanahan
Church and Duncan Group Inc, USA
,
Program Chairs:
Sihem Amer-Yahia
Yahoo! Research, USA
,
Ioana Manolescu
INRIA, France
,
Yi Zhang
University of California, Santa Cruz, USA
,
David A. Evans
JustSystems Evans Research, USA
,
Alek Kolcz
Microsoft Live Labs, USA
,
Key-Sun Choi
KAIST, Korea
,
Abdur Chowdury
Twitter, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM08

Sponsor:

CIKM08: Conference on Information and Knowledge Management

October 26 - 30, 2008

California, Napa Valley, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
505
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Huang QLuo PTung A(2023)A New Sparse Data Clustering Method Based On Frequent ItemsProceedings of the ACM on Management of Data10.1145/35886851:1(1-28)Online publication date: 30-May-2023
https://doi.org/10.1145/3588685
Qian QJin RYi JZhang LZhu S(2015)Efficient distance metric learning by adaptive sampling and mini-batch stochastic gradient descent (SGD)Machine Language10.1007/s10994-014-5456-x99:3(353-372)Online publication date: 1-Jun-2015
https://dl.acm.org/doi/10.1007/s10994-014-5456-x
Miller LGazan RStill SFussell SLutters WMorris MReddy M(2014)Unsupervised classification and visualization of unstructured text for the support of interdisciplinary collaborationProceedings of the 17th ACM conference on Computer supported cooperative work & social computing10.1145/2531602.2531666(1033-1042)Online publication date: 15-Feb-2014
https://dl.acm.org/doi/10.1145/2531602.2531666
Chen XBai BQi YLin QCarbonell J(2010)Learning Preferences with Millions of Parameters by Enforcing SparsityProceedings of the 2010 IEEE International Conference on Data Mining10.1109/ICDM.2010.67(779-784)Online publication date: 13-Dec-2010
https://dl.acm.org/doi/10.1109/ICDM.2010.67
Folino FGreco GGuzzo APontieri L(2010)Scalable parallel co-clustering over multiple heterogeneous data types2010 International Conference on High Performance Computing & Simulation10.1109/HPCS.2010.5547087(529-535)Online publication date: Jun-2010
https://doi.org/10.1109/HPCS.2010.5547087
Wang LChen PHuang LCheung DSong IChu WHu XLin J(2009)An efficient clustering algorithm for large-scale topical web pagesProceedings of the 18th ACM conference on Information and knowledge management10.1145/1645953.1646247(1851-1854)Online publication date: 2-Nov-2009
https://dl.acm.org/doi/10.1145/1645953.1646247
Bekkerman RScholz MViswanathan KElder JFogelman FFlach PZaki M(2009)Improving clustering stability with combinatorial MRFsProceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/1557019.1557037(99-108)Online publication date: 28-Jun-2009
https://dl.acm.org/doi/10.1145/1557019.1557037

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents