research-article

Scalable discovery of best clusters on large graphs

Editors: Elisa Bertino, Paolo Atzeni, Kian Lee Tan, Yi Chen, Y. C. Tay Authors:

Kathy Macropol,

Ambuj SinghAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 3, Issue 1-2

Pages 693 - 702

https://doi.org/10.14778/1920841.1920930

Published: 01 September 2010 Publication History

Abstract

The identification of clusters, well-connected components in a graph, is useful in many applications from biological function prediction to social community detection. However, finding these clusters can be difficult as graph sizes increase. Most current graph clustering algorithms scale poorly in terms of time or memory. An important insight is that many clustering applications need only the subset of best clusters, and not all clusters in the entire graph. In this paper we propose a new technique, Top Graph Clusters (TopGC), which probabilistically searches large, edge weighted, directed graphs for their best clusters in linear time. The algorithm is inherently parallelizable, and is able to find variable size, overlapping clusters. To increase scalability, a parameter is introduced that controls memory use. When compared with three other state-of-the art clustering techniques, TopGC achieves running time speedups of up to 70% on large scale real world datasets. In addition, the clusters returned by TopGC are consistently found to be better both in calculated score and when compared on real world benchmarks.

References

[1]

A. Abourjeili and G. Karypis. Multilevel algorithms for partitioning power-law graphs. In IPDPS, pages 16--575, 2006.

Digital Library

[2]

R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. In FOCS, pages 475--486, Washington, DC, USA, 2006. IEEE Computer Society.

Digital Library

[3]

B. Andreopoulos, A. An, X. Wang, and M. Schroeder. A roadmap of clustering algorithms: finding a match for a biomedical application. Brief Bioinform, 10(4):297--314, May 2009.

[4]

M. Ashburner and et. al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 25(1):25--29, May 2000.

[5]

G. D. Bader and C. W. V. Hogue. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 4(2), 2003.

[6]

A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Comput. Netw. ISDN Syst., 29(8--13):1157--1166, 1997.

Digital Library

[7]

S. Brohee and J. van Helden. Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics, 7:488, November 2006.

[8]

D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-MAT: A recursive model for graph mining. In In SDM, 2004.

[9]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. pages 137--150.

[10]

I. S. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors: A multilevel approach. TPAMI, 29(11):1944--1957, 2007.

Digital Library

[11]

S. Fortunato. Community detection in graphs. Physics Reports, 486:75--174, Feb. 2010.

[12]

A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999.

Digital Library

[13]

T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable techniques for clustering the web (extended abstract). In WebDB, 2000.

[14]

P. Indyk. A small approximately min-wise independent family of hash functions. In SODA '99, pages 454--456, 1999.

Digital Library

[15]

R. Kannan, S. Vempala, and A. Vetta. On clusterings: Good, bad and spectral. J. ACM, 51(3):497--515, May 2004.

Digital Library

[16]

E. F. Krause. Maximizing the product of summands; minimizing the sum of factors. Mathematics Magazine, 69(4):270--278, 1996.

[17]

I. Lee, Z. Li, and E. M. Marcotte. An improved, bias-reduced probabilistic functional gene network of baker's yeast, saccharomyces cerevisiae. PloS one, 2(10):e988+, October 2007.

[18]

J. Leskovec. Stanford network analysis package (snap). http://snap.stanford.edu/.

[19]

J. Leskovec, L. A. Adamic, and B. A. Huberman. The dynamics of viral marketing. In EC, pages 228--237, New York, NY, USA, 2006. ACM.

Digital Library

[20]

J. Leskovec, D. Huttenlocher, and J. Kleinberg. Signed networks in social media. In CHI, 2010.

Digital Library

[21]

J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Technical Report arXiv:0810.1355, 2008.

[22]

K. Macropol, T. Can, and A. Singh. RRW: repeated random walks on genome-scale protein networks for local cluster discovery. BMC Bioinformatics, 10(1):283, 2009.

[23]

N. Mishra, R. Schreiber, I. Stanton, and R. E. Tarjan. Clustering social networks. In WAW, pages 56--67, 2007.

Digital Library

[24]

C. L. Myers, D. R. Barrett, M. A. Hibbs, C. Huttenhower, and O. G. Troyanskaya. Finding function: evaluation methods for functional genomic data. BMC genomics, 7:187, 2006 2006.

[25]

V. Satuluri and S. Parthasarathy. Scalable graph clustering using stochastic flows: applications to community discovery. In KDD, pages 737--746, New York, NY, USA, 2009. ACM.

Digital Library

[26]

J. Shi and J. Malik. Normalized cuts and image segmentation. TPAMI, 22:888--905, 1997.

Digital Library

[27]

D. A. Spielman and S.-H. Teng. A local clustering algorithm for massive graphs and its application to nearly-linear time graph partitioning. CoRR, abs/0809.3232, 2008.

[28]

A. P. Streich, M. Frank, D. Basin, and J. M. Buhmann. Multi-assignment clustering for boolean data. In ICML, pages 969--976, New York, NY, USA, 2009. ACM.

Digital Library

[29]

S. van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, Utrecht, May 2000.

[30]

W.-S. Yang, J.-B. Dia, H.-C. Cheng, and H.-T. Lin. Mining social networks for targeted advertising. In HICSS, page 137.1, 2006.

Digital Library

[31]

Y. Yu, M. Crucianu, V. Oria, and L. Chen. Local summarization and multi-level LSH for retrieving multi-variant audio tracks. In ACM Multimedia, pages 341--350, 2009.

Digital Library

[32]

Z. Zou, J. Li, H. Gao, and S. Zhang. Finding top-k maximal cliques in an uncertain graph. In ICDE, pages 625--636, 2010.

Cited By

Choudhary CSingh IKumar M(2022)Community detection algorithms for recommendation systems: techniques and metricsComputing10.1007/s00607-022-01131-z105:2(417-453)Online publication date: 12-Nov-2022
https://dl.acm.org/doi/10.1007/s00607-022-01131-z
Chattopadhyay SBasu TDas AGhosh KMurthy L(2021)Towards effective discovery of natural communities in complex networks and implications in e-commerceElectronic Commerce Research10.1007/s10660-019-09395-y21:4(917-954)Online publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1007/s10660-019-09395-y
Jiang YHuang XCheng H(2021)I/O efficient k-truss community search in massive graphsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-020-00649-y30:5(713-738)Online publication date: 22-Apr-2021
https://dl.acm.org/doi/10.1007/s00778-020-00649-y
Show More Cited By

Scalable discovery of best clusters on large graphs

Recommendations

Clustering Large Attributed Graphs: A Balance between Structural and Attribute Similarities

Social networks, sensor networks, biological networks, and many other information networks can be modeled as a large graph. Graph vertices represent entities, and graph edges represent their relationships or interactions. In many large graphs, there is ...
Large Induced Forests in Graphs

In this article, we prove three theorems. The first is that every connected graph of order n and size m has an induced forest of order at least 8n-2m-2/9 with equality if and only if such a graph is obtained from a tree by expanding every vertex to a ...
Connectivity of k-extendable graphs with large k
Discrete mathematics and theoretical computer science (DMTCS)

Let G be a simple connected graph on 2n vertices with perfect matching. For a given positive integer k (0 ≤ k ≤ n - 1), G is k-extendable if any matching of size k in G is contained in a perfect matching of G. It is proved that if G is a k-extendable ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 3, Issue 1-2

September 2010

1658 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 September 2010

Published in PVLDB Volume 3, Issue 1-2

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
469
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)1

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Choudhary CSingh IKumar M(2022)Community detection algorithms for recommendation systems: techniques and metricsComputing10.1007/s00607-022-01131-z105:2(417-453)Online publication date: 12-Nov-2022
https://dl.acm.org/doi/10.1007/s00607-022-01131-z
Chattopadhyay SBasu TDas AGhosh KMurthy L(2021)Towards effective discovery of natural communities in complex networks and implications in e-commerceElectronic Commerce Research10.1007/s10660-019-09395-y21:4(917-954)Online publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1007/s10660-019-09395-y
Jiang YHuang XCheng H(2021)I/O efficient k-truss community search in massive graphsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-020-00649-y30:5(713-738)Online publication date: 22-Apr-2021
https://dl.acm.org/doi/10.1007/s00778-020-00649-y
Fang YHuang XQin LZhang YZhang WCheng RLin X(2019)A survey of community search over big graphsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-019-00556-x29:1(353-392)Online publication date: 20-Jul-2019
https://dl.acm.org/doi/10.1007/s00778-019-00556-x
Harenberg SChaudhary MSamatova NChoudhary AWu KDong B(2017)Mining Persistent and Discriminative Communities in Graph EnsemblesProceedings of the 29th International Conference on Scientific and Statistical Database Management10.1145/3085504.3085532(1-6)Online publication date: 27-Jun-2017
https://dl.acm.org/doi/10.1145/3085504.3085532
Saxena APrasad MGupta ABharill NPatel OTiwari AEr MDing WLin C(2017)A review of clustering techniques and developmentsNeurocomputing10.1016/j.neucom.2017.06.053267:C(664-681)Online publication date: 6-Dec-2017
https://dl.acm.org/doi/10.1016/j.neucom.2017.06.053
Khan KDolgorsuren BAnh TNawaz WLee Y(2017)Faster compression methods for a weighted graph using locality sensitive hashingInformation Sciences: an International Journal10.1016/j.ins.2017.07.033421:C(237-253)Online publication date: 1-Dec-2017
https://dl.acm.org/doi/10.1016/j.ins.2017.07.033
Duan LLiu YNick Street WLu H(2017)Utilizing advances in correlation analysis for community structure detectionExpert Systems with Applications: An International Journal10.1016/j.eswa.2017.05.01084:C(74-91)Online publication date: 30-Oct-2017
https://dl.acm.org/doi/10.1016/j.eswa.2017.05.010
Khan KNawaz WLee Y(2017)Set-based unified approach for summarization of a multi-attributed graphWorld Wide Web10.1007/s11280-016-0388-y20:3(543-570)Online publication date: 1-May-2017
https://dl.acm.org/doi/10.1007/s11280-016-0388-y
Hauptman ANeumann FSutton A(2016)Are Evolutionary Computation-Based Methods Comparable to State-of-the-art non-Evolutionary Methods for Community Detection?Proceedings of the 2016 on Genetic and Evolutionary Computation Conference Companion10.1145/2908961.2931643(1465-1466)Online publication date: 20-Jul-2016
https://dl.acm.org/doi/10.1145/2908961.2931643
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents