Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Big graph mining: algorithms and discoveries

Published: 30 April 2013 Publication History

Abstract

How do we find patterns and anomalies in very large graphs with billions of nodes and edges? How to mine such big graphs efficiently? Big graphs are everywhere, ranging from social networks and mobile call networks to biological networks and the World Wide Web. Mining big graphs leads to many interesting applications including cyber security, fraud detection, Web search, recommendation, and many more.
In this paper we describe Pegasus, a big graph mining system built on top of MapReduce, a modern distributed data processing platform. We introduce GIM-V, an important primitive that Pegasus uses for its algorithms to analyze structures of large graphs. We also introduce HEigen, a large scale eigensolver which is also a part of Pegasus. Both GIM-V and HEigen are highly optimized, achieving linear scale up on the number of machines and edges, and providing 9.2x and 76x faster performance than their naive counterparts, respectively.
Using Pegasus, we analyze very large, real world graphs with billions of nodes and edges. Our findings include anomalous spikes in the connected component size distribution, the 7 degrees of separation in a Web graph, and anomalous adult advertisers in the who-follows-whom Twitter social network.

References

[1]
Hadoop information. http://hadoop.apache.org/.
[2]
Hbase information. http://hbase.apache.org/.
[3]
Mahout information. http://lucene.apache.org/mahout/.
[4]
The open cloud consortium. http://opencloudconsortium.org/.
[5]
R. Albert, H. Jeong, and A.-L. Barabasi. Diameter of the world wide web. Nature, (401):130--131, 1999.
[6]
D. A. Bader and K. Madduri. A graph-theoretic analysis of the human protein-interaction network using multicore parallel algorithms. Parallel Comput., 2008.
[7]
A.-L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286(5439):509--512, 1999.
[8]
J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel coordinate descent for l1-regularized loss minimization. In ICML, pages 321--328, 2011.
[9]
A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Computer Networks 33, 2000.
[10]
F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P. Raghavan. On compressing social networks. In KDD, pages 219--228, 2009.
[11]
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008.
[12]
J. W. Demmel. Applied numerical linear algebra. SIAM, 1997.
[13]
D. A. Easley and J. M. Kleinberg. Networks, Crowds, and Markets - Reasoning About a Highly Connected World. Cambridge University Press, 2010.
[14]
N. B. Ellison, C. Steinfield, and C. Lampe. The benefits of facebook friends: social capital and college students use of online social network sites. Journal of Computer- Mediated Communication, 12(4):1143--1168, 2007.
[15]
M. Faloutsos, P. Faloutsos, and C. Faloutsos. On powerlaw relationships of the internet topology. SIGCOMM, pages 251--262, Aug-Sept. 1999.
[16]
R. Gemulla, E. Nijkamp, P. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 69--77. ACM, 2011.
[17]
A. Ghoting, R. Krishnamurthy, E. P. D. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. Systemml: Declarative machine learning on mapreduce. In ICDE, pages 231--242, 2011.
[18]
C. L. Giles, K. D. Bollacker, and S. Lawrence. Citeseer: an automatic citation indexing system. In INTERNATIONAL CONFERENCE ON DIGITAL LIBRARIES, pages 89--98. ACM Press, 1998.
[19]
G. H. Golub and C. F. Van Loan. Matrix computations. Johns Hopkins University Press, 1996.
[20]
U. Kang and C. Faloutsos. Beyond 'caveman communities': Hubs and spokes for graph compression and mining. In ICDM, 2011.
[21]
U. Kang, M. McGlohon, L. Akoglu, and C. Faloutsos. Patterns on the connected components of terabyte-scale graphs. In ICDM, pages 875--880, 2010.
[22]
U. Kang, B. Meeder, and C. Faloutsos. Spectral analysis for billion-scale graphs: Discoveries and implementation. In PAKDD (2), pages 13--25, 2011.
[23]
U. Kang, S. Papadimitriou, J. Sun, and H. Tong. Centralities in large networks: Algorithms and observations. In SDM, pages 119--130, 2011.
[24]
U. Kang, E. E. Papalexakis, A. Harpale, and C. Faloutsos. Gigatensor: scaling tensor analysis up by 100 times -- algorithms and discoveries. In KDD, pages 316--324, 2012.
[25]
U. Kang, H. Tong, J. Sun, C.-Y. Lin, and C. Faloutsos. Gbase: an efficient analysis platform for large graphs. VLDB J., 21(5):637--650, 2012.
[26]
U. Kang, C. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system - implementation and observations. ICDM, 2009.
[27]
U. Kang, C. E. Tsourakakis, A. P. Appel, C. Faloutsos, and J. Leskovec. Hadi: Mining radii of large graphs. ACM Trans. Knowl. Discov. Data, 5:8:1--8:24, February 2011.
[28]
G. Karypis and V. Kumar. Multilevel-way hypergraph partitioning. In DAC, pages 343--348, 1999.
[29]
T. G. Kolda and J. Sun. Scalable tensor decompositions for multi-aspect data mining. In ICDM, pages 363--372, 2008.
[30]
C. Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J. Res. Nat. Bur. Stand., 1950.
[31]
J. Leskovec, J. M. Kleinberg, and C. Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In KDD, pages 177--187, 2005.
[32]
J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Statistical properties of community structure in large social and information networks. In WWW, pages 695--704, 2008.
[33]
C. Liu, F. Guo, and C. Faloutsos. Bbm: bayesian browsing model from petabyte-scale data. In KDD, pages 537--546, 2009.
[34]
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new framework for parallel machine learning. In UAI, pages 340--349, 2010.
[35]
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed graphlab: A framework for machine learning in the cloud. PVLDB, 5(8):716--727, 2012.
[36]
G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD Conference, pages 135--146, 2010.
[37]
M. E. J. Newman. Power laws, pareto distributions and zipf's law. Contemporary Physics, (46):323--351, 2005.
[38]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD '08, pages 1099--1110, 2008.
[39]
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998.
[40]
J.-Y. Pan, H.-J. Yang, C. Faloutsos, and P. Duygulu. Automatic multimedia cross-modal correlation discovery. In KDD, pages 653--658, 2004.
[41]
S. Papadimitriou and J. Sun. Disco: Distributed coclustering with map-reduce. ICDM, 2008.
[42]
B. A. Prakash, M. Seshadri, A. Sridharan, S. Machiraju, and C. Faloutsos. Eigenspokes: Surprising patterns and community structure in large graphs. PAKDD, 2010.
[43]
B. Shao, H. Wang, and Y. Li. The trinity graph engine. In Microsoft Research, 2012.
[44]
A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. S. Sarma, R. Murthy, and H. Liu. Data warehousing and analytics infrastructure at facebook. In SIGMOD Conference, pages 1013--1020, 2010.
[45]
H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and its applications. In ICDM, pages 613--622, 2006.
[46]
L. N. Trefethen and D. Bau III. Numerical linear algebra. SIAM, 1997.
[47]
C. Tsourakakis. Fast counting of triangles in large real networks without counting: Algorithms and laws. In ICDM, 2008.
[48]
D. Wang, D. Pedreschi, C. Song, F. Giannotti, and A.- L. Barabási. Human mobility, social ties, and link prediction. In KDD, pages 1100--1108, 2011.
[49]
M. Zaharia, N. M. M. Chowdhury, M. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. Technical Report UCB/EECS-2010-53, EECS Department, University of California, Berkeley, May 2010.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGKDD Explorations Newsletter
ACM SIGKDD Explorations Newsletter  Volume 14, Issue 2
December 2012
81 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/2481244
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 April 2013
Published in SIGKDD Volume 14, Issue 2

Check for updates

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)3
Reflects downloads up to 31 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media