Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Scalable discovery of best clusters on large graphs

Published: 01 September 2010 Publication History

Abstract

The identification of clusters, well-connected components in a graph, is useful in many applications from biological function prediction to social community detection. However, finding these clusters can be difficult as graph sizes increase. Most current graph clustering algorithms scale poorly in terms of time or memory. An important insight is that many clustering applications need only the subset of best clusters, and not all clusters in the entire graph. In this paper we propose a new technique, Top Graph Clusters (TopGC), which probabilistically searches large, edge weighted, directed graphs for their best clusters in linear time. The algorithm is inherently parallelizable, and is able to find variable size, overlapping clusters. To increase scalability, a parameter is introduced that controls memory use. When compared with three other state-of-the art clustering techniques, TopGC achieves running time speedups of up to 70% on large scale real world datasets. In addition, the clusters returned by TopGC are consistently found to be better both in calculated score and when compared on real world benchmarks.

References

[1]
A. Abourjeili and G. Karypis. Multilevel algorithms for partitioning power-law graphs. In IPDPS, pages 16--575, 2006.
[2]
R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. In FOCS, pages 475--486, Washington, DC, USA, 2006. IEEE Computer Society.
[3]
B. Andreopoulos, A. An, X. Wang, and M. Schroeder. A roadmap of clustering algorithms: finding a match for a biomedical application. Brief Bioinform, 10(4):297--314, May 2009.
[4]
M. Ashburner and et. al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 25(1):25--29, May 2000.
[5]
G. D. Bader and C. W. V. Hogue. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 4(2), 2003.
[6]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Comput. Netw. ISDN Syst., 29(8--13):1157--1166, 1997.
[7]
S. Brohee and J. van Helden. Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics, 7:488, November 2006.
[8]
D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-MAT: A recursive model for graph mining. In In SDM, 2004.
[9]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. pages 137--150.
[10]
I. S. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors: A multilevel approach. TPAMI, 29(11):1944--1957, 2007.
[11]
S. Fortunato. Community detection in graphs. Physics Reports, 486:75--174, Feb. 2010.
[12]
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999.
[13]
T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable techniques for clustering the web (extended abstract). In WebDB, 2000.
[14]
P. Indyk. A small approximately min-wise independent family of hash functions. In SODA '99, pages 454--456, 1999.
[15]
R. Kannan, S. Vempala, and A. Vetta. On clusterings: Good, bad and spectral. J. ACM, 51(3):497--515, May 2004.
[16]
E. F. Krause. Maximizing the product of summands; minimizing the sum of factors. Mathematics Magazine, 69(4):270--278, 1996.
[17]
I. Lee, Z. Li, and E. M. Marcotte. An improved, bias-reduced probabilistic functional gene network of baker's yeast, saccharomyces cerevisiae. PloS one, 2(10):e988+, October 2007.
[18]
J. Leskovec. Stanford network analysis package (snap). http://snap.stanford.edu/.
[19]
J. Leskovec, L. A. Adamic, and B. A. Huberman. The dynamics of viral marketing. In EC, pages 228--237, New York, NY, USA, 2006. ACM.
[20]
J. Leskovec, D. Huttenlocher, and J. Kleinberg. Signed networks in social media. In CHI, 2010.
[21]
J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Technical Report arXiv:0810.1355, 2008.
[22]
K. Macropol, T. Can, and A. Singh. RRW: repeated random walks on genome-scale protein networks for local cluster discovery. BMC Bioinformatics, 10(1):283, 2009.
[23]
N. Mishra, R. Schreiber, I. Stanton, and R. E. Tarjan. Clustering social networks. In WAW, pages 56--67, 2007.
[24]
C. L. Myers, D. R. Barrett, M. A. Hibbs, C. Huttenhower, and O. G. Troyanskaya. Finding function: evaluation methods for functional genomic data. BMC genomics, 7:187, 2006 2006.
[25]
V. Satuluri and S. Parthasarathy. Scalable graph clustering using stochastic flows: applications to community discovery. In KDD, pages 737--746, New York, NY, USA, 2009. ACM.
[26]
J. Shi and J. Malik. Normalized cuts and image segmentation. TPAMI, 22:888--905, 1997.
[27]
D. A. Spielman and S.-H. Teng. A local clustering algorithm for massive graphs and its application to nearly-linear time graph partitioning. CoRR, abs/0809.3232, 2008.
[28]
A. P. Streich, M. Frank, D. Basin, and J. M. Buhmann. Multi-assignment clustering for boolean data. In ICML, pages 969--976, New York, NY, USA, 2009. ACM.
[29]
S. van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, Utrecht, May 2000.
[30]
W.-S. Yang, J.-B. Dia, H.-C. Cheng, and H.-T. Lin. Mining social networks for targeted advertising. In HICSS, page 137.1, 2006.
[31]
Y. Yu, M. Crucianu, V. Oria, and L. Chen. Local summarization and multi-level LSH for retrieving multi-variant audio tracks. In ACM Multimedia, pages 341--350, 2009.
[32]
Z. Zou, J. Li, H. Gao, and S. Zhang. Finding top-k maximal cliques in an uncertain graph. In ICDE, pages 625--636, 2010.

Cited By

View all
  • (2022)Community detection algorithms for recommendation systems: techniques and metricsComputing10.1007/s00607-022-01131-z105:2(417-453)Online publication date: 12-Nov-2022
  • (2021)Towards effective discovery of natural communities in complex networks and implications in e-commerceElectronic Commerce Research10.1007/s10660-019-09395-y21:4(917-954)Online publication date: 1-Dec-2021
  • (2021)I/O efficient k-truss community search in massive graphsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-020-00649-y30:5(713-738)Online publication date: 22-Apr-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 3, Issue 1-2
September 2010
1658 pages

Publisher

VLDB Endowment

Publication History

Published: 01 September 2010
Published in PVLDB Volume 3, Issue 1-2

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)1
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Community detection algorithms for recommendation systems: techniques and metricsComputing10.1007/s00607-022-01131-z105:2(417-453)Online publication date: 12-Nov-2022
  • (2021)Towards effective discovery of natural communities in complex networks and implications in e-commerceElectronic Commerce Research10.1007/s10660-019-09395-y21:4(917-954)Online publication date: 1-Dec-2021
  • (2021)I/O efficient k-truss community search in massive graphsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-020-00649-y30:5(713-738)Online publication date: 22-Apr-2021
  • (2019)A survey of community search over big graphsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-019-00556-x29:1(353-392)Online publication date: 20-Jul-2019
  • (2017)Mining Persistent and Discriminative Communities in Graph EnsemblesProceedings of the 29th International Conference on Scientific and Statistical Database Management10.1145/3085504.3085532(1-6)Online publication date: 27-Jun-2017
  • (2017)A review of clustering techniques and developmentsNeurocomputing10.1016/j.neucom.2017.06.053267:C(664-681)Online publication date: 6-Dec-2017
  • (2017)Faster compression methods for a weighted graph using locality sensitive hashingInformation Sciences: an International Journal10.1016/j.ins.2017.07.033421:C(237-253)Online publication date: 1-Dec-2017
  • (2017)Utilizing advances in correlation analysis for community structure detectionExpert Systems with Applications: An International Journal10.1016/j.eswa.2017.05.01084:C(74-91)Online publication date: 30-Oct-2017
  • (2017)Set-based unified approach for summarization of a multi-attributed graphWorld Wide Web10.1007/s11280-016-0388-y20:3(543-570)Online publication date: 1-May-2017
  • (2016)Are Evolutionary Computation-Based Methods Comparable to State-of-the-art non-Evolutionary Methods for Community Detection?Proceedings of the 2016 on Genetic and Evolutionary Computation Conference Companion10.1145/2908961.2931643(1465-1466)Online publication date: 20-Jul-2016
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media