Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2723372.2749437acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Cost-based Fault-tolerance for Parallel Data Processing

Published: 27 May 2015 Publication History

Abstract

In order to deal with mid-query failures in parallel data engines (PDEs), different fault-tolerance schemes are implemented today: (1) fault-tolerance in parallel databases is typically implemented in a coarse-grained manner by restarting a query completely when a mid-query failure occurs, and (2) modern MapReduce-style PDEs implement a fine-grained fault-tolerance scheme, which either materializes intermediate results or implements a lineage model to recover from mid-query failures. However, neither of these schemes can efficiently handle mixed workloads with both short running interactive queries as well as long running batch queries nor do these schemes efficiently support a wide range of different cluster setups which vary in cluster size and other parameters such as the mean time between failures. In this paper, we present a novel cost-based fault-tolerance scheme which tackles this issue. Compared to the existing schemes, our scheme selects a subset of intermediates to be materialized such that the total query runtime is minimized under mid-query failures. Our experiments show that our cost-based fault-tolerance scheme outperforms all existing strategies and always selects the sweet spot for short- and long running queries as well as for different cluster setups.

References

[1]
Apache Hadoop. http://hadoop.apache.org/.
[2]
HP Vertica Database. http://www.vertica.com/.
[3]
Impala. http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html.
[4]
Pivotal Greenplum Database. http://www.gopivotal.com/big-data/pivotal-greenplum-database.
[5]
SAP HANA Database. www.sap.com/HANA.
[6]
Teradata Database. http://www.teradata.com/.
[7]
C. Binnig, N. May, and T. Mindnich. SQLScript: Efficiently Analyzing Big Enterprise Data in SAP HANA. In BTW, pages 363--382, 2013.
[8]
C. Binnig, A. Salama, A. C. Müller, E. Zamanian, H. Kornmayer, and S. Lising. XDB: a novel database architecture for data analytics as a service. In IEEE Big Data, 2014.
[9]
E. A. Brewer. Towards robust distributed systems. In PODC, page 7, 2000.
[10]
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008.
[11]
H.-I. Hsiao and D. J. DeWitt. A Performance Study of Three High Available Data Replication Strategies. Distributed and Parallel Databases, 1(1):53--80, 1993.
[12]
J.-H. Hwang, M. Balazinska, A. Rasin, U. Çetintemel, M. Stonebraker, and S. B. Zdonik. High-Availability Algorithms for Distributed Stream Processing. In ICDE, pages 779--790, 2005.
[13]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59--72, 2007.
[14]
G. Moerkotte. Building Query Compilers. University of Mannheim, http://pi3.informatik.uni-mannheim.de/ moer/querycompiler.pdf, 2014.
[15]
M. T. Özsu and P. Valduriez. Principles of Distributed Database Systems, Third Edition. Springer, 2011.
[16]
K. Ren, Y. Kwon, M. Balazinska, and B. Howe. Hadoop's Adolescence. PVLDB, 6(10):853--864, 2013.
[17]
N. Tatbul, Y. Ahmad, U. Çetintemel, J.-H. Hwang, Y. Xing, and S. B. Zdonik. Load Management and High Availability in the Borealis Distributed Stream Processing Engine. In GSN, pages 66--85, 2006.
[18]
P. Tobias and D. Trindade. Applied Reliability, Third Edition. Taylor & Francis, 2011.
[19]
P. Upadhyaya, Y. Kwon, and M. Balazinska. A latency and fault-tolerance optimizer for online parallel query plans. In SIGMOD Conference, pages 241--252, 2011.
[20]
F. M. Waas. Beyond Conventional Data Warehousing - Massively Parallel Data Processing with Greenplum Database. In BIRTE (Informal Proceedings), 2008.
[21]
T. White. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 1st edition, 2009.
[22]
R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and rich analytics at scale. In SIGMOD Conference, pages 13--24, 2013.
[23]
C. Yang, C. Yen, C. Tan, and S. Madden. Osprey: Implementing MapReduce-style fault tolerance in a shared-nothing distributed database. In ICDE, pages 657--668, 2010.
[24]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, pages 15--28, 2012.
[25]
J. Zhou, N. Bruno, M.-C. Wu, P.-Å. Larson, R. Chaiken, and D. Shakib. SCOPE: parallel databases meet MapReduce. VLDB J., 21(5), 2012.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
May 2015
2110 pages
ISBN:9781450327589
DOI:10.1145/2723372
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2015

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

SIGMOD/PODS'15
Sponsor:
SIGMOD/PODS'15: International Conference on Management of Data
May 31 - June 4, 2015
Victoria, Melbourne, Australia

Acceptance Rates

SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)2
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)IoT Service Runtime Fault Tolerance Mechanism Based on Flink Dynamic CheckpointService Science10.1007/978-981-99-4402-6_7(91-105)Online publication date: 27-Jul-2023
  • (2022)Dynamic Fault Tolerance for Multi-Node Query ProcessingIEICE Transactions on Information and Systems10.1587/transinf.2021DAP0004E105.D:5(909-919)Online publication date: 1-May-2022
  • (2021)PhoebeProceedings of the VLDB Endowment10.14778/3476249.347629814:11(2505-2518)Online publication date: 27-Oct-2021
  • (2019)Smart Intra-query Fault Tolerance for Massive Parallel Processing DatabasesData Science and Engineering10.1007/s41019-019-00114-z5:1(65-79)Online publication date: 19-Dec-2019
  • (2019)Integrating workload balancing and fault tolerance in distributed stream processing systemWorld Wide Web10.1007/s11280-018-0656-022:6(2471-2496)Online publication date: 7-Jan-2019
  • (2019)An optimal checkpointing model with online OCI adjustment for stream processing applicationsConcurrency and Computation: Practice and Experience10.1002/cpe.534731:20Online publication date: 10-Jun-2019
  • (2018)An Optimal Checkpointing Model with Online OCI Adjustment for Stream Processing Applications2018 27th International Conference on Computer Communication and Networks (ICCCN)10.1109/ICCCN.2018.8487327(1-9)Online publication date: Jul-2018
  • (2018)A Task Allocation Method for Stream Processing with Recovery Latency ConstraintJournal of Computer Science and Technology10.1007/s11390-018-1876-633:6(1125-1139)Online publication date: 19-Nov-2018
  • (2017)Minimum Backups for Stream Processing With Recovery Latency GuaranteesIEEE Transactions on Reliability10.1109/TR.2017.271256366:3(783-794)Online publication date: Sep-2017
  • (2017)Integrated recovery and task allocation for stream processing2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC)10.1109/PCCC.2017.8280443(1-8)Online publication date: Dec-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media