Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2806777.2806934acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Zorro: zero-cost reactive failure recovery in distributed graph processing

Published: 27 August 2015 Publication History

Abstract

Distributed graph processing systems largely rely on proactive techniques for failure recovery. Unfortunately, these approaches (such as checkpointing) entail a significant overhead. In this paper, we argue that distributed graph processing systems should instead use a reactive approach to failure recovery. The reactive approach trades off completeness of the result (generating a slightly inaccurate result) while reducing the overhead during failure-free execution to zero. We build a system called Zorro that imbues this reactive approach, and integrate Zorro into two graph processing systems -- PowerGraph and LFGraph. When a failure occurs, Zorro opportunistically exploits vertex replication inherent in today's graph processing systems to quickly rebuild the state of failed servers. Experiments using real-world graphs demonstrate that Zorro is able to recover over 99% of the graph state when 6--12% of the servers fail, and between 87--95% when half the cluster fails. Furthermore, using various graph processing algorithms, Zorro incurs little to no accuracy loss in all experimental failure scenarios, and achieves a worst-case accuracy of 97%.

References

[1]
Apache Giraph. http://giraph.apache.org/.
[2]
Apache Hama. https://hama.apache.org/.
[3]
Stanford Network Analysis Project. http://snap.stanford.edu/.
[4]
B. Bhargava and S. R. Lian. Independent Checkpointing and Concurrent Rollback for Recovery in Distributed Systems-An Optimistic Approach. In Proceedings of the Symposium on Reliable Distributed Systems. IEEE, 1988.
[5]
P. Boldi and S. Vigna. The WebGraph framework I: Compression techniques. In Proceedings of the International World Wide Web Conference (WWW). ACM, 2004.
[6]
P. Boldi, M. Rosa, M. Santini, and S. Vigna. Layered Label Propagation: A MultiResolution Coordinate-Free Ordering for Compressing Social Networks. In Proceedings of the International Conference on World Wide Web (WWW). ACM, 2011.
[7]
M. Bota, H.-W. Dong, and L. W. Swanson. From gene networks to brain networks. In Nature Neuroscience, 2003.
[8]
R. H. Campbell and B. Randell. Error Recovery in Asynchronous Systems. Transactions on Software Engineering, 1986.
[9]
K. M. Chandy and L. Lamport. Distributed Snapshots: Determining Global States of Distributed Systems. In Transactions on Computer Systems (TOCS). ACM, 1985.
[10]
A. Ching. Scaling Apache Giraph to a Trillion Edges. Facebook Engineering Blog, 2013.
[11]
C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield. Live Migration of Virtual Machines. In Proceedings of the Symposium on Networked Systems Design and Implementation (NSDI). USENIX, 2005.
[12]
E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A Survey of Rollback-recovery Protocols in Message-passing Systems. ACM Computing Surveys (CSUR), 2002.
[13]
M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In ACM SIGCOMM Computer Communication Review, volume 29, pages 251--262. ACM, 1999.
[14]
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In SIGOPS Operating Systems Review. ACM, 2003.
[15]
J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed Graph-Parallel Computation on Natural Graphs. In Proceedings of Symposium on Operating Systems Design and Implementation (OSDI). USENIX, 2012.
[16]
J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph Processing in a Distributed Dataflow Framework. In Proceedings of Symposium on Operating Systems Design and Implementation (OSDI). USENIX, 2014.
[17]
A. Gubichev, S. Bedathur, S. Seufert, and G. Weikum. Fast and Accurate Estimation of Shortest Paths in Large Graphs. In Proceedings of the International Conference on Information and Knowledge Management. ACM, 2010.
[18]
R. W. Hockney. The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel computing, 20(3):389--398, 1994.
[19]
I. Hoque and I. Gupta. LFGraph: Simple and Fast Distributed Graph Analytics. In Proceedings of Conference on Timely Results In Operating Systems (TRIOS). ACM, 2013.
[20]
D. B. Johnson and W. Zwaenepoel. Recovery in Distributed Systems using Asynchronous Message Logging and Checkpointing. In Proceedings of the Symposium on Principles of Distributed Computing (PODC). ACM, 1988.
[21]
S. Y. Ko, I. Hoque, B. Cho, and I. Gupta. Making cloud intermediate data fault-tolerant. In Proceedings of the 1st ACM symposium on Cloud computing. ACM, 2010.
[22]
H. Kwak, C. Lee, H. Park, and S. Moon. What is Twitter, a Social Network or a News Media? In Proceedings of International Conference on World Wide Web (WWW). ACM, 2010.
[23]
A. Kyrola, G. E. Blelloch, and C. Guestrin. GraphChi: Large-Scale Graph Computation on Just a PC. In Proceedings of Symposium on Operating Systems Design and Implementation (OSDI). ACM, 2012.
[24]
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. Proceedings of VLDB Endowment, 2012.
[25]
A. Lowry, J. R. Russell, and A. P. Goldberg. Optimistic Failure Recovery for Very Large Networks. In Proceedings of the Symposium on Reliable Distributed Systems. IEEE, 1991.
[26]
P. Macko, V. J. Marathe, D. W. Margo, and M. I. Seltzer. LLAMA: Efficient Graph Analytics Using Large Multiversioned Arrays. In Proceedings of the International Conference on Data Engineering (ICDE). IEEE, 2015.
[27]
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A System for Large-Scale Graph Processing. In Proceedings of International Conference on Management of Data (SIGMOD). ACM, 2010.
[28]
I. Mitliagkas, M. Borokhovich, A. G. Dimakis, and C. Caramanis. FrogWild!--fast PageRank approximations on graph engines. In Proceedings of VLDB Endowment, 2015.
[29]
A. B. Nagarajan, F. Mueller, C. Engelmann, and S. L. Scott. Proactive Fault Tolerance for HPC with Xen Virtualization. In Proceedings of the International Conference on Supercomputing (SC). ACM, 2007.
[30]
D. Ongaro, S. M. Rumble, R. Stutsman, and J. Ousterhout. Fast crash recovery in RAMCloud. In Proceedings of Symposium on Operating Systems Principles (SOSP). ACM, 2011.
[31]
R. Power and J. Li. Piccolo: Building Fast, Distributed Programs with Partitioned Tables. In Proceedings of Symposium on Operating Systems Design and Implementation (OSDI), 2010.
[32]
M. Pundir, L. M. Leslie, I. Gupta, and R. H. Campbell. Zorro: Zero-cost reactive failure recovery in distributed graph processing. Technical Report, IDEALS, 2015. URL https://ideals.illinois.edu/handle/2142/75959.
[33]
M. Rosenblum and J. K. Ousterhout. The design and implementation of a log-structured file system. In ACM Transactions on Computer Systems (TOCS). ACM, 1992.
[34]
A. Roy, I. Mihailovic, and W. Zwaenepoel. X-stream: Edge-centric graph processing using streaming partitions. In Proceedings of Symposium on Operating Systems Principles (SOSP). ACM, 2013.
[35]
S. Salihoglu and J. Widom. GPS: A Graph Processing System. In Proceedings of International Conference on Scientific and Statistical Database Management. ACM, 2013.
[36]
S. Schelter, S. Ewen, K. Tzoumas, and V. Markl. All Roads lead to Rome: Optimistic Recovery for Distributed Iterative Data Processing. In Proceedings of International Conference on Information and Knowledge Management (CIKM). ACM, 2013.
[37]
S. B. Seidman. Network structure and minimum degree. Social networks, 1983.
[38]
Y. Shen, G. Chen, H. V. Jagadish, W. Lu, B. C. Ooi, and B. M. Tudor. Fast Failure Recovery in Distributed Graph Processing Systems. In Proceedings of the VLDB Endowment, 2015.
[39]
K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop Distributed File System. In Proceedings IEEE Symposium on Mass Storage Systems and Technologies (MSST), 2010.
[40]
R. Strom and S. Yemini. Optimistic Recovery in Distributed Systems. ACM Transactions on Computer Systems (TOCS), 1985.
[41]
P. Wang, K. Zhang, R. Chen, H. Chen, and H. Guan. Replication-based Fault-tolerance for Large-scale Graph Processing. In International Conference on Dependable Systems and Networks (DSN). IEEE, 2014.
[42]
J. W. Young. A First Order Approximation to the Optimum Checkpoint Interval. In Communications of the ACM. ACM, 1974.
[43]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of Conference on Networked Systems Design and Implementation(NSDI). USENIX, 2012.

Cited By

View all
  • (2023)Adaptive Fragment-Based Parallel State Recovery for Stream Processing SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.325199734:8(2464-2478)Online publication date: Aug-2023
  • (2022)ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph ProcessingWeb and Big Data10.1007/978-3-031-25158-0_5(45-59)Online publication date: 11-Aug-2022
  • (2022)Exploiting Unblocking Checkpoint for Fault-Tolerance in Pregel-Like SystemsWeb Information Systems Engineering – WISE 202110.1007/978-3-030-90888-1_6(71-86)Online publication date: 1-Jan-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '15: Proceedings of the Sixth ACM Symposium on Cloud Computing
August 2015
446 pages
ISBN:9781450336512
DOI:10.1145/2806777
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 August 2015

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

SoCC '15
Sponsor:
SoCC '15: ACM Symposium on Cloud Computing
August 27 - 29, 2015
Hawaii, Kohala Coast

Acceptance Rates

SoCC '15 Paper Acceptance Rate 34 of 157 submissions, 22%;
Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Adaptive Fragment-Based Parallel State Recovery for Stream Processing SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.325199734:8(2464-2478)Online publication date: Aug-2023
  • (2022)ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph ProcessingWeb and Big Data10.1007/978-3-031-25158-0_5(45-59)Online publication date: 11-Aug-2022
  • (2022)Exploiting Unblocking Checkpoint for Fault-Tolerance in Pregel-Like SystemsWeb Information Systems Engineering – WISE 202110.1007/978-3-030-90888-1_6(71-86)Online publication date: 1-Jan-2022
  • (2021)How Far Have We Come in Fault Tolerance for Distributed Graph Processing: A Quantitative Assessment of Fault Tolerance Effectiveness2021 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)10.1109/ISSREW53611.2021.00114(427-432)Online publication date: Oct-2021
  • (2021)FreeLauncher: Lossless Failure Recovery of Parameter Servers with Ultralight Replication2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS51616.2021.00052(472-482)Online publication date: Jul-2021
  • (2020)FP4S: Fragment-based Parallel State Recovery for Stateful Stream Applications2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00116(1102-1111)Online publication date: May-2020
  • (2020)Reducing Fault-tolerant Overhead for Distributed Stream Processing with Approximate Backup2020 29th International Conference on Computer Communications and Networks (ICCCN)10.1109/ICCCN49398.2020.9209717(1-9)Online publication date: Aug-2020
  • (2019)PhoenixProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304056(615-630)Online publication date: 4-Apr-2019
  • (2019)HGraph: I/O-efficient Distributed and Iterative Graph Computing by Hybrid Pushing/PullingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.2951407(1-1)Online publication date: 2019
  • (2019)On the performance and convergence of distributed stream processing via approximate fault toleranceThe VLDB Journal10.1007/s00778-019-00565-wOnline publication date: 3-Sep-2019
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media