Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2505515.2505753acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

"All roads lead to Rome": optimistic recovery for distributed iterative data processing

Published: 27 October 2013 Publication History
  • Get Citation Alerts
  • Abstract

    Executing data-parallel iterative algorithms on large datasets is crucial for many advanced analytical applications in the fields of data mining and machine learning. Current systems for executing iterative tasks in large clusters typically achieve fault tolerance through rollback recovery. The principle behind this pessimistic approach is to periodically checkpoint the algorithm state. Upon failure, the system restores a consistent state from a previously written checkpoint and resumes execution from that point.
    We propose an optimistic recovery mechanism using algorithmic compensations. Our method leverages the robust, self-correcting nature of a large class of fixpoint algorithms used in data mining and machine learning, which converge to the correct solution from various intermediate consistent states. In the case of a failure, we apply a user-defined compensate function that algorithmically creates such a consistent state, instead of rolling back to a previous checkpointed state. Our optimistic recovery does not checkpoint any state and hence achieves optimal failure-free performance with respect to the overhead necessary for guaranteeing fault tolerance.
    We illustrate the applicability of this approach for three wide classes of problems. Furthermore, we show how to implement the proposed optimistic recovery mechanism in a data flow system. Similar to the Combine operator in MapReduce, our proposed functionality is optional and can be applied to increase performance without changing the semantics of programs.
    In an experimental evaluation on large datasets, we show that our proposed approach provides optimal failure-free performance. In the absence of failures our optimistic scheme is able to outperform a pessimistic approach by a factor of two to five. In presence of failures, our approach provides fast recovery and outperforms pessimistic approaches in the majority of cases.

    References

    [1]
    Apache Giraph, http://giraph.apache.org.
    [2]
    Apache Hadoop, http://hadoop.apache.org.
    [3]
    D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In SoCC, pp. 119--130, 2010.
    [4]
    D. Bertsekas and J. Tsitsiklis. Parallel and Distributed Computation. Athena Scientific, 1997.
    [5]
    P. Boldi and S. Vigna. The Webgraph Framework I: Compression techniques. In WWW, pp. 595--602, 2004.
    [6]
    P. Bonacich. Power and Centrality: A family of measures. American Journal of Sociology, pp. 1170--1182, 1987.
    [7]
    V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pp. 1151--1162, 2011.
    [8]
    Y. Bu, V. R. Borkar, M. J. Carey, J. Rosen, N. Polyzotis, T. Condie, M. Weimer, and R. Ramakrishnan. Scaling datalog for machine learning on big data. CoRR, abs/1203.0160, 2012.
    [9]
    M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In OSDI, pp. 335--350, 2006.
    [10]
    M. Cha, H. Haddadi, F. Benevenuto, and P. K. Gummadi. Measuring user influence in Twitter: The million follower fallacy. In ICWSM, 2010.
    [11]
    A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: Scalable online collaborative filtering. In WWW, pp. 271--280, 2007.
    [12]
    J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008.
    [13]
    E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375--408, 2002.
    [14]
    S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl. Spinning fast iterative data flows. PVLDB, 5(11):1268--1279, 2012.
    [15]
    H. Garcia-Molina and K. Salem. Sagas. In SIGMOD, pp. 249--259, 1987.
    [16]
    R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In KDD, pp. 69--77, 2011.
    [17]
    S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In SOSP, pp. 29--43, 2003.
    [18]
    G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, 1996.
    [19]
    M. Gondran and M. Minoux. Graphs, Dioids and Semirings - New Models and Algorithms. Springer, 2008.
    [20]
    Y. Hu, Y. Koren, and C. Volinsky. Collaborative filtering for implicit feedback datasets. In ICDM, pp. 263--272, 2008.
    [21]
    U. Kang, S. Papadimitriou, J. Sun, and H. Tong. Centralities in large networks: Algorithms and observations. In SDM, pp. 119--130, 2011.
    [22]
    U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system. In ICDM, pp. 229--238, 2009.
    [23]
    L. Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39--43, 1953.
    [24]
    Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. IEEE Computer, 2009.
    [25]
    Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed GraphLab: A framework for machine learning in the cloud. PVLDB, 5(8):716--727, 2012.
    [26]
    G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In SIGMOD pp. 135--146, 2010.
    [27]
    F. McSherry, D. G. Murray, R. Isaacs, and M. Isard. Differential dataflow. In CIDR, 2013.
    [28]
    S. R. Mihaylov, Z. G. Ives, and S. Guha. Rex: Recursive, delta-based data-centric computation. PVLDB, 5(11):1280--1291, 2012.
    [29]
    M. E. J. Newman. Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E, 74:036104, 2006.
    [30]
    L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Stanford InfoLab, 1999.
    [31]
    A. Reuter and F. Schwenkreis. ConTracts - A low-level mechanism for building general-purpose workflow management-systems. IEEE Data Eng. Bull., 18(1):4--10, 1995.
    [32]
    M. Zaharia, M. Chowdhury, T. Das, D. Ankur, M. McCauley, M. Franklin, S. Shenker and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. NDSI, pp. 2--2 2012.
    [33]
    L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, 1990.
    [34]
    M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica.Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters. In HotCLoud, 2012.
    [35]
    Y. Zhou, D. M. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the netflix prize. In AAIM, pp. 337--348, 2008.

    Cited By

    View all
    • (2024)A Fuzzy PID-Incorporated Stochastic Gradient Descent Algorithm for Fast and Accurate Latent Factor AnalysisIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2024.338973332:7(4049-4061)Online publication date: Jul-2024
    • (2024)Adaptive Divergence-Based Non-Negative Latent Factor Analysis of High-Dimensional and Incomplete Matrices From Industrial ApplicationsIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2023.33325508:2(1209-1222)Online publication date: Apr-2024
    • (2022)An α–β-Divergence-Generalized Recommender for Highly Accurate Predictions of Missing User PreferencesIEEE Transactions on Cybernetics10.1109/TCYB.2020.302642552:8(8006-8018)Online publication date: Aug-2022
    • Show More Cited By

    Index Terms

    1. "All roads lead to Rome": optimistic recovery for distributed iterative data processing

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management
      October 2013
      2612 pages
      ISBN:9781450322638
      DOI:10.1145/2505515
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 October 2013

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. fault-tolerance
      2. iterative algorithms
      3. optimistic recovery

      Qualifiers

      • Research-article

      Conference

      CIKM'13
      Sponsor:
      CIKM'13: 22nd ACM International Conference on Information and Knowledge Management
      October 27 - November 1, 2013
      California, San Francisco, USA

      Acceptance Rates

      CIKM '13 Paper Acceptance Rate 143 of 848 submissions, 17%;
      Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)16
      • Downloads (Last 6 weeks)1

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)A Fuzzy PID-Incorporated Stochastic Gradient Descent Algorithm for Fast and Accurate Latent Factor AnalysisIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2024.338973332:7(4049-4061)Online publication date: Jul-2024
      • (2024)Adaptive Divergence-Based Non-Negative Latent Factor Analysis of High-Dimensional and Incomplete Matrices From Industrial ApplicationsIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2023.33325508:2(1209-1222)Online publication date: Apr-2024
      • (2022)An α–β-Divergence-Generalized Recommender for Highly Accurate Predictions of Missing User PreferencesIEEE Transactions on Cybernetics10.1109/TCYB.2020.302642552:8(8006-8018)Online publication date: Aug-2022
      • (2022)ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph ProcessingWeb and Big Data10.1007/978-3-031-25158-0_5(45-59)Online publication date: 11-Aug-2022
      • (2021)Handling Iterations in Distributed Dataflow SystemsACM Computing Surveys10.1145/347760254:9(1-38)Online publication date: 8-Oct-2021
      • (2021)A Fault-Tolerant Distributed Framework for Asynchronous Iterative ComputationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.305942032:8(2062-2073)Online publication date: 1-Aug-2021
      • (2021)Continuous Distributed Processing of Software Defined Radar2021 CIE International Conference on Radar (Radar)10.1109/Radar53847.2021.10028245(2876-2880)Online publication date: 15-Dec-2021
      • (2021)SmartDL: energy-aware decremental learning in a mobile-based federation for geo-spatial systemNeural Computing and Applications10.1007/s00521-021-06378-935:5(3677-3696)Online publication date: 9-Aug-2021
      • (2021)Hybrid Checkpointing for Iterative Processing in BSP-Based SystemsWeb Information Systems and Applications10.1007/978-3-030-87571-8_60(693-705)Online publication date: 17-Sep-2021
      • (2020)Marabunta: Continuous Distributed Processing of Skewed Streams2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)10.1109/CCGrid49817.2020.00-68(252-261)Online publication date: May-2020
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media