research-article

Bermuda: An Efficient MapReduce Triangle Listing Algorithm for Web-Scale Graphs

Authors:

Mohamed Eltabakh,

Xiangnan KongAuthors Info & Claims

SSDBM '16: Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Article No.: 10, Pages 1 - 12

https://doi.org/10.1145/2949689.2949715

Published: 18 July 2016 Publication History

Abstract

Triangle listing plays an important role in graph analysis and has numerous graph mining applications. With the rapid growth of graph data, distributed methods for listing triangles over massive graphs are urgently needed. Therefore, the triangle listing problem has been studied in several distributed infrastructures including MapReduce. However, existing algorithms suffer from generating and shuffling huge amounts of intermediate data, where interestingly, a large percentage of this data is redundant. Inspired by this observation, we present the "Bermuda" method, an efficient MapReducebased triangle listing technique for massive graphs.

Different from existing approaches, Bermuda effectively reduces the size of the intermediate data via redundancy elimination and sharing of messages whenever possible. As a result, Bermuda achieves orders-of-magnitudes of speedup and enables processing larger graphs that other techniques fail to process under the same resources. Bermuda exploits the locality of processing, i.e., in which reduce instance each graph vertex will be processed, to avoid the redundancy of generating messages from mappers to reducers. Bermuda also proposes novel message sharing techniques within each reduce instance to increase the usability of the received messages. We present and analyze several reduce-side caching strategies that dynamically learn the expected access patterns of the shared messages, and adaptively deploy the appropriate technique for better sharing. Extensive experiments conducted on real-world large-scale graphs show that Bermuda speeds up the triangle listing computations by factors up to 10x. Moreover, with a relatively small cluster, Bermuda can scale up to large datasets, e.g., ClueWeb graph dataset (688GB), while other techniques fail to finish.

References

[1]

N. Alon, R. Yuster, and U. Zwick. Finding and counting given length cycles. Algorithmica, 1997.

Digital Library

[2]

S. Arifuzzaman, M. Khan, and M. Marathe. Patric: A parallel algorithm for counting triangles in massive networks. CIKM, 2013.

Digital Library

[3]

N. Bao and T. Suzumura. Towards highly scalable pregel-based graph processing platform with x10. WWW, 2013.

Digital Library

[4]

V. Batagelj and A. Mrvar. A subquadratic triad census algorithm for large sparse networks with small maximum degree. Social networks, 2001.

[5]

L. Becchetti, P. Boldi, C. Castillo, and A. Gionis. Efficient semi-streaming algorithms for local triangle counting in massive graphs. KDD, 2008.

Digital Library

[6]

J. Berry, B. Hendrickson, R. LaViolette, and C. Phillips. Tolerating the community detection resolution limit with edge weighting. Physical Review E, 2011.

[7]

L. Buriol, G. Frahling, and S. Leonardi. Counting triangles in data streams. VLDB, 2006.

Digital Library

[8]

N. Chiba and T. Nishizeki. Arboricity and subgraph listing algorithms. SIAM Journal on Computing, 1985.

Digital Library

[9]

J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 2008.

Digital Library

[10]

J. Eckmann and E. Moses. Curvature of co-links uncovers hidden thematic layers in the world wide web. Academy of Sciences, 2002.

[11]

J. Gonzalez, R. Xin, A. Dave, and D. Crankshaw. Graphx: Graph processing in a distributed dataflow framework. GRADES,SIGMOD workshop, 2014.

[12]

J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. OSDI, 2012.

Digital Library

[13]

W.-S. Han, S. Lee, K. Park, J.-H. Lee, M.-S. Kim, J. Kim, and H. Yu. Turbograph: a fast parallel graph engine handling billion-scale graphs in a single pc. KDD, 2013.

Digital Library

[14]

X. Hu, Y. Tao, and C. Chung. I/O-Efficient algorithms on triangle listing and counting. ACM Transactions on Database Systems, 2014.

Digital Library

[15]

A. Itai and M. Rodeh. Finding a minimum circuit in a graph. SIAM, 1978.

[16]

G. Keramidas and P. Petoumenos. Cache replacement based on reuse-distance prediction. ICCD, 2007.

[17]

S. Khuller and B. Saha. On finding dense subgraphs. Automata, 2009.

Digital Library

[18]

J. Kim, W. Han, S. Lee, K. Park, and H. Yu. Opt: a new framework for overlapped and parallel triangulation in large-scale graphs. SIGMOD, 2014.

Digital Library

[19]

A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: Large-scale graph computation on just a pc. OSDI, 2012.

Digital Library

[20]

J. Lin. The curse of zipf and limits to parallelization: A look at the stragglers problem in mapreduce. LSDR-IR workshop, 2009.

[21]

G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. SIGMOD, 2010.

Digital Library

[22]

R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, and D. Chklovskii. Network motifs: simple building blocks of complex networks. Academy of Sciences, 2002.

[23]

G. W. Oehlert. A note on the delta method. The American Statistician, 1992.

[24]

H. Park, F. Silvestri, U. Kang, and R. Pagh. MapReduce triangle enumeration with guarantees. CIKM, 2014.

Digital Library

[25]

P. Petoumenos and G. Keramidas. Instruction-based reuse-distance prediction for effective cache management. 2009.

[26]

S. Salihoglu and J. Widom. Gps: A graph processing system. SSDBM, 2013.

Digital Library

[27]

A. D. Sarma, F. N. Afrati, S. Salihoglu, and J. D. Ullman. Upper and lower bounds on the cost of a map-reduce computation. PVLDB, 2013.

Digital Library

[28]

T. Schank. Algorithmic aspects of triangle-based network analysis. Phd in computer science, 2007.

[29]

Y. Shao, B. Cui, L. Chen, L. Ma, J. Yao, and N. Xu. Parallel subgraph listing in a large-scale graph. SIGMOD, 2014.

Digital Library

[30]

K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. MSST, 2010.

Digital Library

[31]

S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. KDD, 2011.

Digital Library

[32]

C. H. C. Teixeira, A. J. Fonseca, M. Serafini, G. Siganos, M. J. Zaki, and A. Aboulnaga. Arabesque: a system for distributed graph mining. SOSP, 2015.

Digital Library

[33]

T. White. Hadoop: The definitive guide. 2010.

Digital Library

Cited By

Liao QYang Y(2017)Estimating Clustering Coefficient via Random Walk on MapReduce2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS.2017.00071(493-500)Online publication date: Dec-2017
https://doi.org/10.1109/ICPADS.2017.00071

Recommendations

GraphBuilder: scalable graph ETL framework
GRADES '13: First International Workshop on Graph Data Management Experiences and Systems

Graph abstraction is essential for many applications from finding a shortest path to executing complex machine learning (ML) algorithms like collaborative filtering. Graph construction from raw data for various applications is becoming challenging, due ...
MapReduce: Review and open challenges

The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
Design and Development of a Medical Big Data Processing System Based on Hadoop

Secondary use of medical big data is increasingly popular in healthcare services and clinical research. Understanding the logic behind medical big data demonstrates tendencies in hospital information technology and shows great significance for hospital ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SSDBM '16: Proceedings of the 28th International Conference on Scientific and Statistical Database Management

July 2016

290 pages

ISBN:9781450342155

DOI:10.1145/2949689

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SSDBM '16

SSDBM '16: Conference on Scientific and Statistical Database Management

July 18 - 20, 2016

Budapest, Hungary

Acceptance Rates

Overall Acceptance Rate 56 of 146 submissions, 38%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
64
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liao QYang Y(2017)Estimating Clustering Coefficient via Random Walk on MapReduce2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS.2017.00071(493-500)Online publication date: Dec-2017
https://doi.org/10.1109/ICPADS.2017.00071

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents