research-article

Big graph mining: algorithms and discoveries

Authors:

Christos FaloutsosAuthors Info & Claims

ACM SIGKDD Explorations Newsletter, Volume 14, Issue 2

Pages 29 - 36

https://doi.org/10.1145/2481244.2481249

Published: 30 April 2013 Publication History

Abstract

How do we find patterns and anomalies in very large graphs with billions of nodes and edges? How to mine such big graphs efficiently? Big graphs are everywhere, ranging from social networks and mobile call networks to biological networks and the World Wide Web. Mining big graphs leads to many interesting applications including cyber security, fraud detection, Web search, recommendation, and many more.

In this paper we describe Pegasus, a big graph mining system built on top of MapReduce, a modern distributed data processing platform. We introduce GIM-V, an important primitive that Pegasus uses for its algorithms to analyze structures of large graphs. We also introduce HEigen, a large scale eigensolver which is also a part of Pegasus. Both GIM-V and HEigen are highly optimized, achieving linear scale up on the number of machines and edges, and providing 9.2x and 76x faster performance than their naive counterparts, respectively.

Using Pegasus, we analyze very large, real world graphs with billions of nodes and edges. Our findings include anomalous spikes in the connected component size distribution, the 7 degrees of separation in a Web graph, and anomalous adult advertisers in the who-follows-whom Twitter social network.

References

[1]

Hadoop information. http://hadoop.apache.org/.

[2]

Hbase information. http://hbase.apache.org/.

[3]

Mahout information. http://lucene.apache.org/mahout/.

[4]

The open cloud consortium. http://opencloudconsortium.org/.

[5]

R. Albert, H. Jeong, and A.-L. Barabasi. Diameter of the world wide web. Nature, (401):130--131, 1999.

[6]

D. A. Bader and K. Madduri. A graph-theoretic analysis of the human protein-interaction network using multicore parallel algorithms. Parallel Comput., 2008.

Digital Library

[7]

A.-L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286(5439):509--512, 1999.

[8]

J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel coordinate descent for l1-regularized loss minimization. In ICML, pages 321--328, 2011.

Digital Library

[9]

A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Computer Networks 33, 2000.

Digital Library

[10]

F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P. Raghavan. On compressing social networks. In KDD, pages 219--228, 2009.

Digital Library

[11]

J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008.

Digital Library

[12]

J. W. Demmel. Applied numerical linear algebra. SIAM, 1997.

Digital Library

[13]

D. A. Easley and J. M. Kleinberg. Networks, Crowds, and Markets - Reasoning About a Highly Connected World. Cambridge University Press, 2010.

Digital Library

[14]

N. B. Ellison, C. Steinfield, and C. Lampe. The benefits of facebook friends: social capital and college students use of online social network sites. Journal of Computer- Mediated Communication, 12(4):1143--1168, 2007.

[15]

M. Faloutsos, P. Faloutsos, and C. Faloutsos. On powerlaw relationships of the internet topology. SIGCOMM, pages 251--262, Aug-Sept. 1999.

Digital Library

[16]

R. Gemulla, E. Nijkamp, P. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 69--77. ACM, 2011.

Digital Library

[17]

A. Ghoting, R. Krishnamurthy, E. P. D. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. Systemml: Declarative machine learning on mapreduce. In ICDE, pages 231--242, 2011.

Digital Library

[18]

C. L. Giles, K. D. Bollacker, and S. Lawrence. Citeseer: an automatic citation indexing system. In INTERNATIONAL CONFERENCE ON DIGITAL LIBRARIES, pages 89--98. ACM Press, 1998.

Digital Library

[19]

G. H. Golub and C. F. Van Loan. Matrix computations. Johns Hopkins University Press, 1996.

[20]

U. Kang and C. Faloutsos. Beyond 'caveman communities': Hubs and spokes for graph compression and mining. In ICDM, 2011.

Digital Library

[21]

U. Kang, M. McGlohon, L. Akoglu, and C. Faloutsos. Patterns on the connected components of terabyte-scale graphs. In ICDM, pages 875--880, 2010.

Digital Library

[22]

U. Kang, B. Meeder, and C. Faloutsos. Spectral analysis for billion-scale graphs: Discoveries and implementation. In PAKDD (2), pages 13--25, 2011.

Digital Library

[23]

U. Kang, S. Papadimitriou, J. Sun, and H. Tong. Centralities in large networks: Algorithms and observations. In SDM, pages 119--130, 2011.

[24]

U. Kang, E. E. Papalexakis, A. Harpale, and C. Faloutsos. Gigatensor: scaling tensor analysis up by 100 times -- algorithms and discoveries. In KDD, pages 316--324, 2012.

Digital Library

[25]

U. Kang, H. Tong, J. Sun, C.-Y. Lin, and C. Faloutsos. Gbase: an efficient analysis platform for large graphs. VLDB J., 21(5):637--650, 2012.

Digital Library

[26]

U. Kang, C. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system - implementation and observations. ICDM, 2009.

Digital Library

[27]

U. Kang, C. E. Tsourakakis, A. P. Appel, C. Faloutsos, and J. Leskovec. Hadi: Mining radii of large graphs. ACM Trans. Knowl. Discov. Data, 5:8:1--8:24, February 2011.

Digital Library

[28]

G. Karypis and V. Kumar. Multilevel-way hypergraph partitioning. In DAC, pages 343--348, 1999.

Digital Library

[29]

T. G. Kolda and J. Sun. Scalable tensor decompositions for multi-aspect data mining. In ICDM, pages 363--372, 2008.

Digital Library

[30]

C. Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J. Res. Nat. Bur. Stand., 1950.

[31]

J. Leskovec, J. M. Kleinberg, and C. Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In KDD, pages 177--187, 2005.

Digital Library

[32]

J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Statistical properties of community structure in large social and information networks. In WWW, pages 695--704, 2008.

Digital Library

[33]

C. Liu, F. Guo, and C. Faloutsos. Bbm: bayesian browsing model from petabyte-scale data. In KDD, pages 537--546, 2009.

Digital Library

[34]

Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new framework for parallel machine learning. In UAI, pages 340--349, 2010.

Digital Library

[35]

Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed graphlab: A framework for machine learning in the cloud. PVLDB, 5(8):716--727, 2012.

Digital Library

[36]

G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD Conference, pages 135--146, 2010.

Digital Library

[37]

M. E. J. Newman. Power laws, pareto distributions and zipf's law. Contemporary Physics, (46):323--351, 2005.

[38]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD '08, pages 1099--1110, 2008.

Digital Library

[39]

L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998.

[40]

J.-Y. Pan, H.-J. Yang, C. Faloutsos, and P. Duygulu. Automatic multimedia cross-modal correlation discovery. In KDD, pages 653--658, 2004.

Digital Library

[41]

S. Papadimitriou and J. Sun. Disco: Distributed coclustering with map-reduce. ICDM, 2008.

Digital Library

[42]

B. A. Prakash, M. Seshadri, A. Sridharan, S. Machiraju, and C. Faloutsos. Eigenspokes: Surprising patterns and community structure in large graphs. PAKDD, 2010.

Digital Library

[43]

B. Shao, H. Wang, and Y. Li. The trinity graph engine. In Microsoft Research, 2012.

[44]

A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. S. Sarma, R. Murthy, and H. Liu. Data warehousing and analytics infrastructure at facebook. In SIGMOD Conference, pages 1013--1020, 2010.

Digital Library

[45]

H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and its applications. In ICDM, pages 613--622, 2006.

Digital Library

[46]

L. N. Trefethen and D. Bau III. Numerical linear algebra. SIAM, 1997.

[47]

C. Tsourakakis. Fast counting of triangles in large real networks without counting: Algorithms and laws. In ICDM, 2008.

Digital Library

[48]

D. Wang, D. Pedreschi, C. Song, F. Giannotti, and A.- L. Barabási. Human mobility, social ties, and link prediction. In KDD, pages 1100--1108, 2011.

Digital Library

[49]

M. Zaharia, N. M. M. Chowdhury, M. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. Technical Report UCB/EECS-2010-53, EECS Department, University of California, Berkeley, May 2010.

Cited By

Wang XTang LLiu YZhan HFeng X(2021)Diversified Pattern Mining on Large GraphsDatabase and Expert Systems Applications10.1007/978-3-030-86472-9_16(171-184)Online publication date: 27-Sep-2021
https://dl.acm.org/doi/10.1007/978-3-030-86472-9_16
Wang XXiang MZhan HLan ZHe YHe YSha Y(2021)Distributed Top-k Pattern MiningWeb and Big Data10.1007/978-3-030-85899-5_16(203-220)Online publication date: 23-Aug-2021
https://dl.acm.org/doi/10.1007/978-3-030-85899-5_16
Alexopoulos ADrakopoulos GKanavos AMylonas PVonitsanos G(2020)Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache SparkAlgorithms10.3390/a1303007113:3(71)Online publication date: 24-Mar-2020
https://doi.org/10.3390/a13030071
Show More Cited By

Index Terms

Big graph mining: algorithms and discoveries
1. Information systems

Recommendations

Big graph mining for the web and social media: algorithms, anomaly detection, and applications
WSDM '14: Proceedings of the 7th ACM international conference on Web search and data mining

Graphs are everywhere: social networks, computer net- works, mobile call networks, the World Wide Web, protein interaction networks, and many more. The lower cost of disk storage, the success of social networking websites and Web 2.0 applications, and ...
Mining Big Data
ICEIS 2015: Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1

Nowadays, the daily amount of generated data is measured in exabytes. Such huge data is now referred to as Big Data. Big data mining leads to the discovery of the useful information from huge data repositories. However, this huge amount of data hinders ...
Big Graph Processing Systems: State-of-the-Art and Open Challenges
BIGDATASERVICE '15: Proceedings of the 2015 IEEE First International Conference on Big Data Computing Service and Applications

Graph is a fundamental data structure that captures relationships between different data entities. In practice, graphs are widely used for modeling complicated data in different application domains such as social networks, protein networks, ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGKDD Explorations Newsletter

ACM SIGKDD Explorations Newsletter Volume 14, Issue 2

December 2012

81 pages

ISSN:1931-0145

EISSN:1931-0153

DOI:10.1145/2481244

Issue’s Table of Contents

Copyright © 2013 Authors.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 April 2013

Published in SIGKDD Volume 14, Issue 2

Check for updates

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

41
Total Citations
View Citations
1,568
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)3

Reflects downloads up to 31 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang XTang LLiu YZhan HFeng X(2021)Diversified Pattern Mining on Large GraphsDatabase and Expert Systems Applications10.1007/978-3-030-86472-9_16(171-184)Online publication date: 27-Sep-2021
https://dl.acm.org/doi/10.1007/978-3-030-86472-9_16
Wang XXiang MZhan HLan ZHe YHe YSha Y(2021)Distributed Top-k Pattern MiningWeb and Big Data10.1007/978-3-030-85899-5_16(203-220)Online publication date: 23-Aug-2021
https://dl.acm.org/doi/10.1007/978-3-030-85899-5_16
Alexopoulos ADrakopoulos GKanavos AMylonas PVonitsanos G(2020)Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache SparkAlgorithms10.3390/a1303007113:3(71)Online publication date: 24-Mar-2020
https://doi.org/10.3390/a13030071
Bae MJeong MOh S(2020)Label Propagation-Based Parallel Graph Partitioning for Large-Scale Graph DataIEEE Access10.1109/ACCESS.2020.29873558(72801-72813)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.2987355
Karami ALundy MWebb FDwivedi Y(2020)Twitter and Research: A Systematic Literature Review Through Text MiningIEEE Access10.1109/ACCESS.2020.29836568(67698-67717)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.2983656
Ryżko D(2020)BIBLIOGRAPHYModern Big Data Architectures10.1002/9781119597926.biblio(161-177)Online publication date: 7-Apr-2020
https://doi.org/10.1002/9781119597926.biblio
Narayanapppa MChannabasamma AHegadi R(2019)Need of Hadoop and Map Reduce for Processing and Managing Big DataWeb Services10.4018/978-1-5225-7501-6.ch082(1588-1600)Online publication date: 2019
https://doi.org/10.4018/978-1-5225-7501-6.ch082
Saini A(2019)Big Data Mining Using Collaborative FilteringWeb Services10.4018/978-1-5225-7501-6.ch038(702-711)Online publication date: 2019
https://doi.org/10.4018/978-1-5225-7501-6.ch038
Kim BChung JGil JShon J(2019)Parallel Graph Clustering Based on MinhashAdvances in Computer Science and Ubiquitous Computing10.1007/978-981-13-9341-9_67(393-395)Online publication date: 4-Dec-2019
https://doi.org/10.1007/978-981-13-9341-9_67
Singh MAnu VWalia G(2019)A Vertical Breadth-First Multilevel Path Algorithm to Find All Paths in a GraphData Management and Analysis10.1007/978-3-030-32587-9_10(155-183)Online publication date: 21-Dec-2019
https://doi.org/10.1007/978-3-030-32587-9_10
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents