article

CCFinder: using Spark to find clustering coefficient in big graphs

Authors:

Hassan Haghighi,

Saeed ShahrivariAuthors Info & Claims

The Journal of Supercomputing, Volume 73, Issue 11

Pages 4683 - 4710

https://doi.org/10.1007/s11227-017-2040-8

Published: 01 November 2017 Publication History

Abstract

Networks with billions of vertices introduce new challenges to perform graph analysis in a reasonable time. Clustering coefficient is an important analytical measure of networks such as social networks and biological networks. To compute clustering coefficient in big graphs, existing distributed algorithms suffer from low efficiency such that they may fail due to demanding lots of memory, or even, if they complete successfully, their execution time is not acceptable for real-world applications. We present a distributed MapReduce-based algorithm, called CCFinder, to efficiently compute clustering coefficient in very big graphs. CCFinder is executed on Apache Spark, a scalable data processing platform. It efficiently detects existing triangles through using our proposed data structure, called FONL, which is cached in the distributed memory provided by Spark and reused multiple times. As data items in the FONL are fine-grained and contain the minimum required information, CCFinder requires less storage space and has better parallelism in comparison with its competitors. To find clustering coefficient, our solution to triangle counting is extended to have degree information of the vertices in the appropriate places. We performed several experiments on a Spark cluster with 60 processors. The results show that CCFinder achieves acceptable scalability and outperforms six existing competitor methods. Four competitors are those methods proposed based on graph processing systems, i.e., GraphX, NScale, NScaleSpark, and Pregel frameworks, and two others are the Cohen's method and NodeIterator++, introduced based on MapReduce.

References

[1]

Watts DJ, Strogatz SH (1998) Collective dynamics of `small-world'networks. Nature 393(6684):440---442

[2]

Newman ME (2003) The structure and function of complex networks. SIAM Rev 45(2):167---256

Digital Library

[3]

Kim BJ (2004) Performance of networks of artificial neurons: the role of clustering. Phys Rev E 69(4):045101

[4]

Centola D (2010) The spread of behavior in an online social network experiment. Science 329(5996):1194---1197

[5]

Huang Z (2006) Link prediction based on graph topology: the predictive value of generalized clustering coefficient. Paper presented at the Workshop on Link Analysis: Dynamics and Static of Large Networks (LinkKDD2006)

[6]

Goldstein R, Vitevitch MS (2013) The influence of clustering coefficient on word-learning: how groups of similar sounding words facilitate acquisition. Front Psychol 5:1307---1307

[7]

Newman ME (2009) Random graphs with clustering. Phys Rev Lett 103(5):058701

[8]

Saramäki J, Kaski K (2004) Scale-free networks generated by random walkers. Phys A Stat Mech Appl 341:80---86

[9]

Dorogovtsev SN, Goltsev AV, Mendes JFF (2002) Pseudofractal scale-free web. Phys Rev E 65(6):066122

[10]

Suri S, Vassilvitskii S (2011) Counting triangles and the curse of the last reducer. In: Proceedings of the 20th International Conference on World Wide Web, 2011. ACM, pp 607---614

[11]

Chung FR, Lu L (2006) Complex graphs and networks, vol 107. American Mathematical Society, Providence

[12]

Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U (2002) Network motifs: simple building blocks of complex networks. Science 298(5594):824---827

[13]

Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media? In: Proceedings of the 19th International Conference on World Wide Web, 2010. ACM, pp 591---600

Digital Library

[14]

Ye P, Peyser BD, Spencer FA, Bader JS (2005) Commensurate distances and similar motifs in genetic congruence and protein interaction networks in yeast. BMC Bioinform 6(1):270

[15]

White T (2012) Hadoop: the definitive guide. O'Reilly Media, Newton

[16]

Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. HotCloud 10(10---10):95

[17]

Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S (2016) Mllib: machine learning in apache spark. J Mach Learn Res 17(34):1---7

Digital Library

[18]

Chen J, Li K, Tang Z, Bilal K, Yu S, Weng C, Li K (2017) A parallel random forest algorithm for big data in a Spark cloud computing environment. IEEE Trans Parallel Distrib Syst 28(4):919---933

Digital Library

[19]

Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107---113

Digital Library

[20]

Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 2010. ACM, pp 135---146

[21]

Quamar A, Deshpande A, Lin J (2016) NScale: neighborhood-centric large-scale graph analytics in the cloud. VLDB J 25(2):125---150

Digital Library

[22]

Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM (2012) Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proc VLDB Endow 5(8):716---727

Digital Library

[23]

Gonzalez JE, Xin RS, Dave A, Crankshaw D, Franklin MJ, Stoica I (2014) GraphX: graph processing in a distributed dataflow framework. In: OSDI, 2014, pp 599---613

Digital Library

[24]

Quamar A, Deshpande A (2016) NScaleSpark: subgraph-centric graph analytics on Apache Spark. In: Proceedings of the 1st ACM SIGMOD Workshop on Network Data Analytics, 2016. ACM, p 5

[25]

Soffer SN, Vazquez A (2005) Network clustering coefficient without degree-correlation biases. Phys Rev E 71(5):057101

[26]

Spark: Lightning-fast cluster computing, http://spark.apache.org/docs/latest/programming-guide.html. Accessed 1 Oct 2016

[27]

Ortmann M, Brandes U (2014) Triangle listing algorithms: back from the diversion. In: 2014 Proceedings of the Sixteenth Workshop on Algorithm Engineering and Experiments (ALENEX), 2014. SIAM, pp 1---8

[28]

Schank T (2007) Algorithmic aspects of triangle-based network analysis. Dissertation, University Karlsruhe

[29]

Schank T, Wagner D (2005) counting and listing all triangles in large graphs, an experimental study. In: International Workshop on Experimental and Efficient Algorithms, 2005. Springer, pp 606---609

Digital Library

[30]

Latapy M (2008) Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor Comput Sci 407(1---3):458---473

[31]

Itai A, Rodeh M (1978) Finding a minimum circuit in a graph. SIAM J Comput 7(4):413---423

Digital Library

[32]

Arifuzzaman S, Khan M, Marathe M (2013) PATRIC: a parallel algorithm for counting triangles in massive networks. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, 2013. ACM, pp 529---538

[33]

Cohen J (2009) Graph twiddling in a mapreduce world. Comput Sci Eng 11(4):29---41

Digital Library

[34]

Park H-M, Silvestri F, Kang U, Pagh R (2014) Mapreduce triangle enumeration with guarantees. In: Proceedings of the 23rd ACM International Conference on Information and Knowledge Management, 2014. ACM, pp 1739---1748

[35]

Park H-M, Chung C-W (2013) An efficient MapReduce algorithm for counting triangles in a very large graph. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, 2013. ACM, pp 539---548

Digital Library

[36]

Apache Giraph, http://giraph.apache.org/. Accessed 1 Oct 2016

[37]

Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C (2012) PowerGraph: distributed graph-parallel computation on natural graphs. In: OSDI, 2012, vol 1, p 2

[38]

Quick L, Wilkinson P, Hardcastle D (2012) Using pregel-like large scale graph processing frameworks for social network analysis. In: Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), 2012. IEEE Computer Society, pp 457---463

[39]

SNAP: Stanford Network Analysis Project. http://snap.stanford.edu. Accessed 1 Oct 2016

[40]

Yang J, Leskovec J (2015) Defining and evaluating network communities based on ground-truth. Knowl Inf Syst 42(1):181---213

Digital Library

[41]

Backstrom L, Huttenlocher D, Kleinberg J, Lan X (2006) Group formation in large social networks: membership, growth, and evolution. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006. ACM, pp 44---54

Digital Library

[42]

Cha M, Haddadi H, Benevenuto F, Gummadi PK (2010) Measuring user influence in twitter: the million follower fallacy. ICWSM 10(10---17):30

Cited By

Ali Mohamed MEl-henawy ISalah A(2021)Usages of Spark Framework with Different Machine Learning AlgorithmsComputational Intelligence and Neuroscience10.1155/2021/18969532021Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1155/2021/1896953
Alemi MHaghighi H(2019)KTMinerInformation Systems10.1016/j.is.2019.03.01483:C(195-216)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.1016/j.is.2019.03.014
Kang SLee SKim J(2019)Distributed graph cube generation using Spark frameworkThe Journal of Supercomputing10.1007/s11227-019-02746-476:10(8118-8139)Online publication date: 10-Jan-2019
https://dl.acm.org/doi/10.1007/s11227-019-02746-4
Show More Cited By

CCFinder: using Spark to find clustering coefficient in big graphs

Recommendations

BC-BSP: A BSP-Based Parallel Iterative Processing System for Big Data on Cloud Architecture
Proceedings of the 18th International Conference on Database Systems for Advanced Applications - Volume 7827

Many applications in real life can produce and collect large amount of data and many of them can be modeled by Graph. The number of vertexes of a graph could be several hundreds of millions to billions and the number of edges could be ten or more times ...
Counting Triangles in Massive Graphs with MapReduce
^† Special Section on Two Themes: Planet Earth and Big Data

Graphs and networks are used to model interactions in a variety of contexts. There is a growing need to quickly assess the characteristics of a graph in order to understand its underlying structure. Some of the most useful metrics are triangle-based and give ...
Big Graph Processing Systems: State-of-the-Art and Open Challenges
BIGDATASERVICE '15: Proceedings of the 2015 IEEE First International Conference on Big Data Computing Service and Applications

Graph is a fundamental data structure that captures relationships between different data entities. In practice, graphs are widely used for modeling complicated data in different application domains such as social networks, protein networks, ...

Comments

Information & Contributors

Information

Published In

cover image The Journal of Supercomputing

The Journal of Supercomputing Volume 73, Issue 11

November 2017

469 pages

ISSN:0920-8542

Issue’s Table of Contents

Copyright © Copyright © 2017 Springer Science+Business Media, LLC.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 November 2017

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ali Mohamed MEl-henawy ISalah A(2021)Usages of Spark Framework with Different Machine Learning AlgorithmsComputational Intelligence and Neuroscience10.1155/2021/18969532021Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1155/2021/1896953
Alemi MHaghighi H(2019)KTMinerInformation Systems10.1016/j.is.2019.03.01483:C(195-216)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.1016/j.is.2019.03.014
Kang SLee SKim J(2019)Distributed graph cube generation using Spark frameworkThe Journal of Supercomputing10.1007/s11227-019-02746-476:10(8118-8139)Online publication date: 10-Jan-2019
https://dl.acm.org/doi/10.1007/s11227-019-02746-4
Li HLi MZhou Z(2019)Towards One Reusable Model for Various Software Defect Mining TasksAdvances in Knowledge Discovery and Data Mining10.1007/978-3-030-16142-2_17(212-224)Online publication date: 14-Apr-2019
https://dl.acm.org/doi/10.1007/978-3-030-16142-2_17

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents