research-article

Zorro: zero-cost reactive failure recovery in distributed graph processing

Authors:

Luke M. Leslie,

Indranil Gupta,

Roy H. CampbellAuthors Info & Claims

SoCC '15: Proceedings of the Sixth ACM Symposium on Cloud Computing

Pages 195 - 208

https://doi.org/10.1145/2806777.2806934

Published: 27 August 2015 Publication History

Abstract

Distributed graph processing systems largely rely on proactive techniques for failure recovery. Unfortunately, these approaches (such as checkpointing) entail a significant overhead. In this paper, we argue that distributed graph processing systems should instead use a reactive approach to failure recovery. The reactive approach trades off completeness of the result (generating a slightly inaccurate result) while reducing the overhead during failure-free execution to zero. We build a system called Zorro that imbues this reactive approach, and integrate Zorro into two graph processing systems -- PowerGraph and LFGraph. When a failure occurs, Zorro opportunistically exploits vertex replication inherent in today's graph processing systems to quickly rebuild the state of failed servers. Experiments using real-world graphs demonstrate that Zorro is able to recover over 99% of the graph state when 6--12% of the servers fail, and between 87--95% when half the cluster fails. Furthermore, using various graph processing algorithms, Zorro incurs little to no accuracy loss in all experimental failure scenarios, and achieves a worst-case accuracy of 97%.

References

[1]

Apache Giraph. http://giraph.apache.org/.

[2]

Apache Hama. https://hama.apache.org/.

[3]

Stanford Network Analysis Project. http://snap.stanford.edu/.

[4]

B. Bhargava and S. R. Lian. Independent Checkpointing and Concurrent Rollback for Recovery in Distributed Systems-An Optimistic Approach. In Proceedings of the Symposium on Reliable Distributed Systems. IEEE, 1988.

[5]

P. Boldi and S. Vigna. The WebGraph framework I: Compression techniques. In Proceedings of the International World Wide Web Conference (WWW). ACM, 2004.

Digital Library

[6]

P. Boldi, M. Rosa, M. Santini, and S. Vigna. Layered Label Propagation: A MultiResolution Coordinate-Free Ordering for Compressing Social Networks. In Proceedings of the International Conference on World Wide Web (WWW). ACM, 2011.

Digital Library

[7]

M. Bota, H.-W. Dong, and L. W. Swanson. From gene networks to brain networks. In Nature Neuroscience, 2003.

[8]

R. H. Campbell and B. Randell. Error Recovery in Asynchronous Systems. Transactions on Software Engineering, 1986.

Digital Library

[9]

K. M. Chandy and L. Lamport. Distributed Snapshots: Determining Global States of Distributed Systems. In Transactions on Computer Systems (TOCS). ACM, 1985.

Digital Library

[10]

A. Ching. Scaling Apache Giraph to a Trillion Edges. Facebook Engineering Blog, 2013.

[11]

C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield. Live Migration of Virtual Machines. In Proceedings of the Symposium on Networked Systems Design and Implementation (NSDI). USENIX, 2005.

Digital Library

[12]

E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A Survey of Rollback-recovery Protocols in Message-passing Systems. ACM Computing Surveys (CSUR), 2002.

Digital Library

[13]

M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In ACM SIGCOMM Computer Communication Review, volume 29, pages 251--262. ACM, 1999.

Digital Library

[14]

S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In SIGOPS Operating Systems Review. ACM, 2003.

Digital Library

[15]

J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed Graph-Parallel Computation on Natural Graphs. In Proceedings of Symposium on Operating Systems Design and Implementation (OSDI). USENIX, 2012.

Digital Library

[16]

J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph Processing in a Distributed Dataflow Framework. In Proceedings of Symposium on Operating Systems Design and Implementation (OSDI). USENIX, 2014.

Digital Library

[17]

A. Gubichev, S. Bedathur, S. Seufert, and G. Weikum. Fast and Accurate Estimation of Shortest Paths in Large Graphs. In Proceedings of the International Conference on Information and Knowledge Management. ACM, 2010.

Digital Library

[18]

R. W. Hockney. The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel computing, 20(3):389--398, 1994.

Digital Library

[19]

I. Hoque and I. Gupta. LFGraph: Simple and Fast Distributed Graph Analytics. In Proceedings of Conference on Timely Results In Operating Systems (TRIOS). ACM, 2013.

Digital Library

[20]

D. B. Johnson and W. Zwaenepoel. Recovery in Distributed Systems using Asynchronous Message Logging and Checkpointing. In Proceedings of the Symposium on Principles of Distributed Computing (PODC). ACM, 1988.

Digital Library

[21]

S. Y. Ko, I. Hoque, B. Cho, and I. Gupta. Making cloud intermediate data fault-tolerant. In Proceedings of the 1st ACM symposium on Cloud computing. ACM, 2010.

Digital Library

[22]

H. Kwak, C. Lee, H. Park, and S. Moon. What is Twitter, a Social Network or a News Media? In Proceedings of International Conference on World Wide Web (WWW). ACM, 2010.

Digital Library

[23]

A. Kyrola, G. E. Blelloch, and C. Guestrin. GraphChi: Large-Scale Graph Computation on Just a PC. In Proceedings of Symposium on Operating Systems Design and Implementation (OSDI). ACM, 2012.

Digital Library

[24]

Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. Proceedings of VLDB Endowment, 2012.

Digital Library

[25]

A. Lowry, J. R. Russell, and A. P. Goldberg. Optimistic Failure Recovery for Very Large Networks. In Proceedings of the Symposium on Reliable Distributed Systems. IEEE, 1991.

[26]

P. Macko, V. J. Marathe, D. W. Margo, and M. I. Seltzer. LLAMA: Efficient Graph Analytics Using Large Multiversioned Arrays. In Proceedings of the International Conference on Data Engineering (ICDE). IEEE, 2015.

[27]

G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A System for Large-Scale Graph Processing. In Proceedings of International Conference on Management of Data (SIGMOD). ACM, 2010.

Digital Library

[28]

I. Mitliagkas, M. Borokhovich, A. G. Dimakis, and C. Caramanis. FrogWild!--fast PageRank approximations on graph engines. In Proceedings of VLDB Endowment, 2015.

Digital Library

[29]

A. B. Nagarajan, F. Mueller, C. Engelmann, and S. L. Scott. Proactive Fault Tolerance for HPC with Xen Virtualization. In Proceedings of the International Conference on Supercomputing (SC). ACM, 2007.

Digital Library

[30]

D. Ongaro, S. M. Rumble, R. Stutsman, and J. Ousterhout. Fast crash recovery in RAMCloud. In Proceedings of Symposium on Operating Systems Principles (SOSP). ACM, 2011.

Digital Library

[31]

R. Power and J. Li. Piccolo: Building Fast, Distributed Programs with Partitioned Tables. In Proceedings of Symposium on Operating Systems Design and Implementation (OSDI), 2010.

Digital Library

[32]

M. Pundir, L. M. Leslie, I. Gupta, and R. H. Campbell. Zorro: Zero-cost reactive failure recovery in distributed graph processing. Technical Report, IDEALS, 2015. URL https://ideals.illinois.edu/handle/2142/75959.

[33]

M. Rosenblum and J. K. Ousterhout. The design and implementation of a log-structured file system. In ACM Transactions on Computer Systems (TOCS). ACM, 1992.

Digital Library

[34]

A. Roy, I. Mihailovic, and W. Zwaenepoel. X-stream: Edge-centric graph processing using streaming partitions. In Proceedings of Symposium on Operating Systems Principles (SOSP). ACM, 2013.

Digital Library

[35]

S. Salihoglu and J. Widom. GPS: A Graph Processing System. In Proceedings of International Conference on Scientific and Statistical Database Management. ACM, 2013.

Digital Library

[36]

S. Schelter, S. Ewen, K. Tzoumas, and V. Markl. All Roads lead to Rome: Optimistic Recovery for Distributed Iterative Data Processing. In Proceedings of International Conference on Information and Knowledge Management (CIKM). ACM, 2013.

Digital Library

[37]

S. B. Seidman. Network structure and minimum degree. Social networks, 1983.

[38]

Y. Shen, G. Chen, H. V. Jagadish, W. Lu, B. C. Ooi, and B. M. Tudor. Fast Failure Recovery in Distributed Graph Processing Systems. In Proceedings of the VLDB Endowment, 2015.

Digital Library

[39]

K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop Distributed File System. In Proceedings IEEE Symposium on Mass Storage Systems and Technologies (MSST), 2010.

Digital Library

[40]

R. Strom and S. Yemini. Optimistic Recovery in Distributed Systems. ACM Transactions on Computer Systems (TOCS), 1985.

Digital Library

[41]

P. Wang, K. Zhang, R. Chen, H. Chen, and H. Guan. Replication-based Fault-tolerance for Large-scale Graph Processing. In International Conference on Dependable Systems and Networks (DSN). IEEE, 2014.

Digital Library

[42]

J. W. Young. A First Order Approximation to the Optimum Checkpoint Interval. In Communications of the ACM. ACM, 1974.

Digital Library

[43]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of Conference on Networked Systems Design and Implementation(NSDI). USENIX, 2012.

Digital Library

Cited By

Xu HLiu PAhmed SDa Silva DHu L(2023)Adaptive Fragment-Based Parallel State Recovery for Stream Processing SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.325199734:8(2464-2478)Online publication date: Aug-2023
https://doi.org/10.1109/TPDS.2023.3251997
Xu CYang YPan QZhou H(2022)ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph ProcessingWeb and Big Data10.1007/978-3-031-25158-0_5(45-59)Online publication date: 11-Aug-2022
https://dl.acm.org/doi/10.1007/978-3-031-25158-0_5
Yang YYang ZXu C(2022)Exploiting Unblocking Checkpoint for Fault-Tolerance in Pregel-Like SystemsWeb Information Systems Engineering – WISE 202110.1007/978-3-030-90888-1_6(71-86)Online publication date: 1-Jan-2022
https://doi.org/10.1007/978-3-030-90888-1_6
Show More Cited By

Index Terms

Zorro: zero-cost reactive failure recovery in distributed graph processing
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
2. Software and its engineering
  1. Software organization and properties
    1. Software system structures
      1. Distributed systems organizing principles

Recommendations

Differential fault attack on Zorro block cipher

Zorro is a 24-round block cipher presented at the CHES 2013 conference. In this paper, we propose a differential fault attack on Zorro under a byte fault model, in which faults are injected in the 20th round of Zorro at arbitrary positions. With two ...
On the Multichromatic Number of s-Stable Kneser Graphs

For positive integers n and s, a subset Sï [n] is s-stable if sï |i-j|ï n-s for distinct i,j∈S . The s-stable r-uniform Kneser hypergraph KGrn,ks-stable is the r-uniform hypergraph that has the collection of all s-stable k-element subsets of [n] as ...
Adjacent vertex-distinguishing edge and total chromatic numbers of hypercubes

An adjacent vertex-distinguishing edge coloring of a simple graph G is a proper edge coloring of G such that incident edge sets of any two adjacent vertices are assigned different sets of colors. A total coloring of a graph G is a coloring of both the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SoCC '15: Proceedings of the Sixth ACM Symposium on Cloud Computing

August 2015

446 pages

ISBN:9781450336512

DOI:10.1145/2806777

General Chair:
Shahram Ghandeharizadeh
University of Southern California
,
Program Chairs:
Magdalena Balazinska
University of Washington
,
Michael J. Freedman
Princeton University
,
Publications Chair:
Sumita Barahmand
Microsoft

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 August 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

SoCC '15

Sponsor:

SoCC '15: ACM Symposium on Cloud Computing

August 27 - 29, 2015

Hawaii, Kohala Coast

Acceptance Rates

SoCC '15 Paper Acceptance Rate 34 of 157 submissions, 22%;

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

30
Total Citations
View Citations
518
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xu HLiu PAhmed SDa Silva DHu L(2023)Adaptive Fragment-Based Parallel State Recovery for Stream Processing SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.325199734:8(2464-2478)Online publication date: Aug-2023
https://doi.org/10.1109/TPDS.2023.3251997
Xu CYang YPan QZhou H(2022)ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph ProcessingWeb and Big Data10.1007/978-3-031-25158-0_5(45-59)Online publication date: 11-Aug-2022
https://dl.acm.org/doi/10.1007/978-3-031-25158-0_5
Yang YYang ZXu C(2022)Exploiting Unblocking Checkpoint for Fault-Tolerance in Pregel-Like SystemsWeb Information Systems Engineering – WISE 202110.1007/978-3-030-90888-1_6(71-86)Online publication date: 1-Jan-2022
https://doi.org/10.1007/978-3-030-90888-1_6
Zhang CLi YYang YJia THou Z(2021)How Far Have We Come in Fault Tolerance for Distributed Graph Processing: A Quantitative Assessment of Fault Tolerance Effectiveness2021 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)10.1109/ISSREW53611.2021.00114(427-432)Online publication date: Oct-2021
https://doi.org/10.1109/ISSREW53611.2021.00114
Zhang YLi JZhang YWang LLiu L(2021)FreeLauncher: Lossless Failure Recovery of Parameter Servers with Ultralight Replication2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS51616.2021.00052(472-482)Online publication date: Jul-2021
https://doi.org/10.1109/ICDCS51616.2021.00052
Liu PXu HDa Silva DWang QAhmed SHu L(2020)FP4S: Fragment-based Parallel State Recovery for Stateful Stream Applications2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00116(1102-1111)Online publication date: May-2020
https://doi.org/10.1109/IPDPS47924.2020.00116
Zhuang YWei XLi HHou MWang Y(2020)Reducing Fault-tolerant Overhead for Distributed Stream Processing with Approximate Backup2020 29th International Conference on Computer Communications and Networks (ICCCN)10.1109/ICCCN49398.2020.9209717(1-9)Online publication date: Aug-2020
https://doi.org/10.1109/ICCCN49398.2020.9209717
Dathathri RGill GHoang LPingali KBahar IHerlihy MWitchel ELebeck A(2019)PhoenixProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304056(615-630)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304056
Wang ZGu YBao YYu GYu JWei Z(2019)HGraph: I/O-efficient Distributed and Iterative Graph Computing by Hybrid Pushing/PullingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.2951407(1-1)Online publication date: 2019
https://doi.org/10.1109/TKDE.2019.2951407
Cheng ZHuang QLee P(2019)On the performance and convergence of distributed stream processing via approximate fault toleranceThe VLDB Journal10.1007/s00778-019-00565-wOnline publication date: 3-Sep-2019
https://doi.org/10.1007/s00778-019-00565-w
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents