Abstract
In recent year, many large-scale iterative graph computation systems such as Pregel have been developed. To ensure that these systems are fault-tolerant, checkpointing, which archives graph states onto distributed file systems periodically, has been proposed. However, fault-tolerance remains to be challenging because the whole data set is archived with a static interval, rendering underlying graph computations to entail I/O-costs in terms of disk and network communication. Motivated by this, we first propose to dynamically adjust checkpoint intervals based on a carefully designed cost-analysis model, by taking the underlying computing workload into account. Furthermore, for algorithms that can be restarted from any point during computations, we prioritize graph states and then checkpointing can be performed with selected data, instead of the entire dataset, to reduce archiving overhead while simultaneously guaranteeing the failure recovery efficiency. Finally, we conduct extensive performance studies to confirm the effectiveness of our approaches over existing up-to-date solutions using a broad spectrum of real-world graphs.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Apache flink. https://flink.apache.org/
Apache hadoop. http://hadoop.apache.org/
Apache spark. http://spark.apache.org/
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Proceedings of the 7th International Conference on the World Wide Web, pp. 107–117. Elsevier, Amsterdam (1998)
Bu, Y., Borkar, V., Jia, J., Carey, M.J., Condie, T.: Pregelix: Big (ger) graph analytics on a dataflow engine. Proc. VLDB Endow. 8(2), 161–172 (2014)
Chen, R., Shi, J., Chen, Y., Chen, H.: Powerlyra: differentiated graph computation and partitioning on skewed graphs. In: Proceedings of EuroSys, p. 1. ACM, New York (2015)
Chen, Z.: Algorithm-based recovery for iterative methods without checkpointing. In: Proceedings of HPDC, pp. 73–84. ACM, New York (2011)
Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gen. Comput. Syst. 22(3), 303–312 (2006)
Ewen, S., Tzoumas, K., Kaufmann, M., Markl, V.: Spinning fast iterative data flows. Proc. VLDB Endow. 5(11), 1268–1279 (2012)
Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: Powergraph: distributed graph-parallel computation on natural graphs. In: Proceedings of OSDI, vol. 12, p. 2 (2012)
Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: Graphx: graph processing in a distributed dataflow framework. In: Proceedings of OSDI, pp. 599–613 (2014)
Giraph. http://giraph.apache.org/
Han, M., Daudjee, K., Ammar, K., Özsu, M.T., Wang, X., Jin, T.: An experimental comparison of pregel-like graph processing systems. Proc. VLDB Endow. 7(12), 1047–1058 (2014)
Katz, L.: A new status index derived from sociometric analysis. Psychometrika 18(1), 39–43 (1953)
Khayyat, Z., Awara, K., Alonazi, A., Jamjoom, H., Williams, D., Kalnis, P.: Mizan: a system for dynamic load balancing in large-scale graph processing. In: Proceedings of Eurosys, pp. 169–182. ACM, New York (2013)
Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5(8), 716–727 (2012)
Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of SIGMOD, pp. 135–146. ACM, New York (2010)
Pundir, M., Leslie, L.M., Gupta, I., Campbell, R.H.: Zorro: zero-cost reactive failure recovery in distributed graph processing. In: Proceedings of the Sixth ACM Symposium on Cloud Computing (SoCC), pp. 195–208. ACM, New York (2015)
Salihoglu, S., Widom, J.: GPS: a graph processing system. In: Proceedings of SSDBM, p. 22. ACM, New York (2013)
Schelter, S., Ewen, S., Tzoumas, K., Markl, V.: All roads lead to Rome: optimistic recovery for distributed iterative data processing. In: Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pp. 1919–1928. ACM, New York (2013)
Seo, S., Yoon, E.J., Kim, J., Jin, S., Kim, J.S., Maeng, S.: Hama: an efficient matrix computation with the mapreduce framework. In: CloudCom, pp. 721–726. IEEE, Washington (2010)
Shang, Z., Yu, J.X.: Catch the wind: graph workload balancing on cloud. In: Proceedings of ICDE, pp. 553–564. IEEE, New York (2013)
Shen, Y., Chen, G., Jagadish, H., Lu, W., Ooi, B.C., Tudor, B.M.: Fast failure recovery in distributed graph processing systems. Proc. VLDB Endow. 8(4), 437–448 (2014)
Tian, Y., Balmin, A., Corsten, S.A., Tatikonda, S., McPherson, J.: From think like a vertex to think like a graph. Proc. VLDB Endow. 7(3), 193–204 (2013)
Wang, Z., Gao, L., Gu, Y., Bao, Y., Yu, G.: A fault-tolerant framework for asynchronous iterative computations in cloud environments. In: Proceedings of the Seventh ACM Symposium on Cloud Computing (SoCC), pp. 71–83. ACM, New York (2016)
Wang, Z., Gu, Y., Bao, Y., Yu, G., Yu, J.X.: Hybrid pulling/pushing for i/o-efficient distributed and iterative graph computing. In: Proceedings of SIGMOD, pp. 479–494. ACM, New York (2016)
Xie, C., Chen, R., Guan, H., Zang, B., Chen, H.: Sync or async: time to fuse for distributed graph-parallel computation. In: Proceedings of PPoPP, pp. 194–204. ACM, New York (2015)
Xu, C., Holzemer, M., Kaul, M., Markl, V.: Efficient fault-tolerance for iterative graph processing on distributed dataflow systems. In: Proceedings of ICDE, pp. 613–624. IEEE, New York (2016)
Xue, J., Yang, Z., Qu, Z., Hou, S., Dai, Y.: Seraph: an efficient, low-cost system for concurrent graph processing. In: Proceedings of HPDC, pp. 227–238. ACM, New York (2014)
Yan, D., Cheng, J., Lu, Y., Ng, W.: Blogel: a block-centric framework for distributed computation on real-world graphs. Proc. VLDB Endow. 7(14), 1981–1992 (2014)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of NSDI, pp. 2–2. USENIX Association, Berkeley (2012)
Zhang, Y., Gao, Q., Gao, L., Wang, C.: Priter: a distributed framework for prioritizing iterative computations. IEEE Trans. Parallel Distrib. Syst. 24(9), 1884–1893 (2013)
Zhang, Y., Gao, Q., Gao, L., Wang, C.: Maiter: an asynchronous graph processing framework for delta-based accumulative iterative computation. TPDS 25(8), 2091–2100 (2014)
Zhou, C., Gao, J., Sun, B., Yu, J.X.: Mocgraph: scalable distributed graph processing using message online computing. Proc. VLDB Endow. 8(4), 377–388 (2014)
Acknowledgements
This work was supported by the National Natural Science Foundation of China (61472071, 61433008, 61528203, 61272179, and 61602103), and the U.S. NSF Grant CNS-1217284. Zhigang Wang was a visiting student at UMass Amherst, supported by China Scholarship Council, when this work was performed. Authors are also grateful to anonymous reviewers for their constructive comments.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, Z., Gu, Y., Bao, Y. et al. An I/O-efficient and adaptive fault-tolerant framework for distributed graph computations. Distrib Parallel Databases 35, 177–196 (2017). https://doi.org/10.1007/s10619-017-7192-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-017-7192-2