Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing

Published: 04 April 2017 Publication History

Abstract

Existing distributed asynchronous graph processing systems employ checkpointing to capture globally consistent snapshots and rollback all machines to most recent checkpoint to recover from machine failures. In this paper we argue that recovery in distributed asynchronous graph processing does not require the entire execution state to be rolled back to a globally consistent state due to the relaxed asynchronous execution semantics. We define the properties required in the recovered state for it to be usable for correct asynchronous processing and develop CoRAL, a lightweight checkpointing and recovery algorithm. First, this algorithm carries out confined recovery that only rolls back graph execution states of the failed machines to affect recovery. Second, it relies upon lightweight checkpoints that capture locally consistent snapshots with a reduced peak network bandwidth requirement. Our experiments using real-world graphs show that our technique recovers from failures and finishes processing 1.5x to 3.2x faster compared to the traditional asynchronous checkpointing and recovery mechanism when failures impact 1 to 6 machines of a 16 machine cluster. Moreover, capturing locally consistent snapshots significantly reduces intermittent high peak bandwidth usage required to save the snapshots -- the average reduction in 99th percentile bandwidth ranges from 22% to 51% while 1 to 6 snapshot replicas are being maintained.

References

[1]
L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. Group formation in large social networks: Membership, growth, and evolution. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 44--54, 2006.
[2]
P. Boldi and S. Vigna. The WebGraph framework I: Compression techniques. In WWW, pages 595--601, 2004.
[3]
K. M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM TOCS, 3(1):63--75, Feb. 1985.
[4]
A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan. One trillion edges: graph processing at facebook-scale. In Proc. VLDB Endowment, 2015.
[5]
E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34(3):375--408, Sept. 2002.
[6]
A. Farahat, T. LoFaro, J. C. Miller, G. Rae, and L. A. Ward. Authority rankings from hits, pagerank, and salsa: Existence, uniqueness, and effect of initialization. SIAM Jornal of Scientific Computing, 27(4):1181--1201, Nov. 2005.
[7]
J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. GraphX: Graph processing in a distributed dataflow framework. In USENIX OSDI, pages 599--613, 2014.
[8]
M. Han and K. Daudjee. Giraph unchained: Barrierless asynchronous parallel execution in pregel-like graph processing systems. Proc. VLDB Endowment, 8(9):950--961, May 2015.
[9]
Harshvardhan, A. Fidel, N. M. Amato, and L. Rauchwerger. Kla: A new algorithmic paradigm for parallel graph computations. In PACT, pages 27--38, New York, NY, 2014.
[10]
P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: Wait-free coordination for internet-scale systems. In USENIX ATC, pages 11--11, Berkeley, CA, 2010.
[11]
H. Kwak, C. Lee, H. Park, and S. Moon. What is Twitter, a social network or a news media? In WWW, 2010.%pages 591--600, 2010.
[12]
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endowment, 5(8):716--727, Apr. 2012.
[13]
G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, G. Czajkowski, and G. Inc. Pregel: A system for large-scale graph processing. In ACM SIGMOD, pages 135--146, 2010.
[14]
D. Manivannan and M. Singhal. Quasi-synchronous checkpointing: Models, characterization, and classification. IEEE TPDS, 10(7):703--713, 1999.
[15]
D. Ongaro, S. M. Rumble, R. Stutsman, J. Ousterhout, and M. Rosenblum. Fast crash recovery in ramcloud. In ACM SOSP, pages 29--41, New York, NY, USA, 2011. ACM.\newpage
[16]
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998.
[17]
R. Power and J. Li. Piccolo: Building fast, distributed programs with partitioned tables. In USENIX OSDI, pages 293--306, Berkeley, CA, USA, 2010.
[18]
M. Pundir, L. M. Leslie, I. Gupta, and R. H. Campbell. Zorro: Zero-cost reactive failure recovery in distributed graph processing. In ACM SoCC, pages 195--208, 2015.
[19]
S. Salihoglu and J. Widom. GPS: A graph processing system. In Scientific and Statistical Database Management Conference, pages 22:1--22:12, 2013.
[20]
B. Shao, H. Wang, and Y. Li. Trinity: A distributed graph engine on a memory cloud. In ACM SIGMOD, pages 505--516, 2013.
[21]
Y. Shen, G. Chen, H. V. Jagadish, W. Lu, B. C. Ooi, and B. M. Tudor. Fast failure recovery in distributed graph processing systems. Proc. VLDB Endowment, 8(4):437--448, Dec. 2014.
[22]
L. G. Valiant. A bridging model for parallel computation. CACM, 33(8):103--111, Aug. 1990.
[23]
H. Cui, J. Cipar, Q. Ho, J.K. Kim, S, Lee, A. Kumar, J. Wei, W. Dai, G.R. Ganger, P.B. Gibbons, G.A. Gibson, and E.P. Xing. Exploiting Bounded Staleness to Speed Up Big Data Analytics. In USENIX ATC, pages 37--48, 2014.
[24]
K. Vora, G. Xu, and R. Gupta. Load the Edges You Need: A Generic I/O Optimization for Disk-based Graph Processing. In USENIX ATC, pages 507--522, 2016.
[25]
K. Vora, S. C. Koduru, and R. Gupta. ASPIRE: Exploiting Asynchronous Parallelism in Iterative Algorithms using a Relaxed Consistency based DSM. In OOPSLA, pages 861--878, 2014.
[26]
G. Wang, W. Xie, A. Demers, and J. Gehrke. Asynchronous large-scale graph processing made easy. In Conference on Innovative Data Systems Research (CIDR), 2013.
[27]
P. Wang, K. Zhang, R. Chen, and H. Chen. Replication-based fault-tolerance for large-scale graph processing. In IEEE/IFIP DSN, pages 562--573, 2014.
[28]
J. W. Young. A first order approximation to the optimum checkpoint interval. CACM, 17(9):530--531, Sept. 1974.
[29]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX NSDI, pages 2--2, 2012.
[30]
ZeroMQ. http://zeromq.org/.
[31]
Y. Zhang, Q. Gao, L. Gao, and C. Wang. Accelerate large-scale iterative computation through asynchronous accumulative updates. In ScienceCloud, 2012.
[32]
X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical Report CALD-02--107, Carnegie Mellon University, 2002.

Cited By

View all
  • (2023)ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph ProcessingWeb and Big Data10.1007/978-3-031-25158-0_5(45-59)Online publication date: 10-Feb-2023
  • (2022)Exploiting Unblocking Checkpoint for Fault-Tolerance in Pregel-Like SystemsWeb Information Systems Engineering – WISE 202110.1007/978-3-030-90888-1_6(71-86)Online publication date: 1-Jan-2022
  • (2024)Core Graph: Exploiting Edge Centrality to Speedup the Evaluation of Iterative Graph QueriesProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629571(18-32)Online publication date: 22-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 52, Issue 4
ASPLOS '17
April 2017
811 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3093336
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems
    April 2017
    856 pages
    ISBN:9781450344654
    DOI:10.1145/3037697
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2017
Published in SIGPLAN Volume 52, Issue 4

Check for updates

Author Tags

  1. distributed processing
  2. fault tolerance
  3. graph processing

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)151
  • Downloads (Last 6 weeks)35
Reflects downloads up to 10 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph ProcessingWeb and Big Data10.1007/978-3-031-25158-0_5(45-59)Online publication date: 10-Feb-2023
  • (2022)Exploiting Unblocking Checkpoint for Fault-Tolerance in Pregel-Like SystemsWeb Information Systems Engineering – WISE 202110.1007/978-3-030-90888-1_6(71-86)Online publication date: 1-Jan-2022
  • (2024)Core Graph: Exploiting Edge Centrality to Speedup the Evaluation of Iterative Graph QueriesProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629571(18-32)Online publication date: 22-Apr-2024
  • (2023)MEGA Evolving Graph AcceleratorProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614260(310-323)Online publication date: 28-Oct-2023
  • (2023)Redundancy-Free High-Performance Dynamic GNN Training with Hierarchical Pipeline ParallelismProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3592990(17-30)Online publication date: 7-Aug-2023
  • (2023)Expressway: Prioritizing Edges for Distributed Evaluation of Graph Queries2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386860(4362-4371)Online publication date: 15-Dec-2023
  • (2022)GGraph: An Efficient Structure-Aware Approach for Iterative Graph ProcessingIEEE Transactions on Big Data10.1109/TBDATA.2020.30196418:5(1182-1194)Online publication date: 1-Oct-2022
  • (2021)How Far Have We Come in Fault Tolerance for Distributed Graph Processing: A Quantitative Assessment of Fault Tolerance Effectiveness2021 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)10.1109/ISSREW53611.2021.00114(427-432)Online publication date: Oct-2021
  • (2021)DepGraph: A Dependency-Driven Accelerator for Efficient Iterative Graph Processing2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00039(371-384)Online publication date: Feb-2021
  • (2020)AsynGraphACM Transactions on Architecture and Code Optimization10.1145/341649517:4(1-21)Online publication date: 30-Sep-2020
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media