research-article

Cost-based Fault-tolerance for Parallel Data Processing

Authors:

Abdallah Salama,

Carsten Binnig,

Erfan ZamanianAuthors Info & Claims

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Pages 285 - 297

https://doi.org/10.1145/2723372.2749437

Published: 27 May 2015 Publication History

Abstract

In order to deal with mid-query failures in parallel data engines (PDEs), different fault-tolerance schemes are implemented today: (1) fault-tolerance in parallel databases is typically implemented in a coarse-grained manner by restarting a query completely when a mid-query failure occurs, and (2) modern MapReduce-style PDEs implement a fine-grained fault-tolerance scheme, which either materializes intermediate results or implements a lineage model to recover from mid-query failures. However, neither of these schemes can efficiently handle mixed workloads with both short running interactive queries as well as long running batch queries nor do these schemes efficiently support a wide range of different cluster setups which vary in cluster size and other parameters such as the mean time between failures. In this paper, we present a novel cost-based fault-tolerance scheme which tackles this issue. Compared to the existing schemes, our scheme selects a subset of intermediates to be materialized such that the total query runtime is minimized under mid-query failures. Our experiments show that our cost-based fault-tolerance scheme outperforms all existing strategies and always selects the sweet spot for short- and long running queries as well as for different cluster setups.

References

[1]

Apache Hadoop. http://hadoop.apache.org/.

[2]

HP Vertica Database. http://www.vertica.com/.

[3]

Impala. http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html.

[4]

Pivotal Greenplum Database. http://www.gopivotal.com/big-data/pivotal-greenplum-database.

[5]

SAP HANA Database. www.sap.com/HANA.

[6]

Teradata Database. http://www.teradata.com/.

[7]

C. Binnig, N. May, and T. Mindnich. SQLScript: Efficiently Analyzing Big Enterprise Data in SAP HANA. In BTW, pages 363--382, 2013.

[8]

C. Binnig, A. Salama, A. C. Müller, E. Zamanian, H. Kornmayer, and S. Lising. XDB: a novel database architecture for data analytics as a service. In IEEE Big Data, 2014.

Digital Library

[9]

E. A. Brewer. Towards robust distributed systems. In PODC, page 7, 2000.

Digital Library

[10]

J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008.

Digital Library

[11]

H.-I. Hsiao and D. J. DeWitt. A Performance Study of Three High Available Data Replication Strategies. Distributed and Parallel Databases, 1(1):53--80, 1993.

Digital Library

[12]

J.-H. Hwang, M. Balazinska, A. Rasin, U. Çetintemel, M. Stonebraker, and S. B. Zdonik. High-Availability Algorithms for Distributed Stream Processing. In ICDE, pages 779--790, 2005.

Digital Library

[13]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59--72, 2007.

Digital Library

[14]

G. Moerkotte. Building Query Compilers. University of Mannheim, http://pi3.informatik.uni-mannheim.de/ moer/querycompiler.pdf, 2014.

[15]

M. T. Özsu and P. Valduriez. Principles of Distributed Database Systems, Third Edition. Springer, 2011.

Digital Library

[16]

K. Ren, Y. Kwon, M. Balazinska, and B. Howe. Hadoop's Adolescence. PVLDB, 6(10):853--864, 2013.

Digital Library

[17]

N. Tatbul, Y. Ahmad, U. Çetintemel, J.-H. Hwang, Y. Xing, and S. B. Zdonik. Load Management and High Availability in the Borealis Distributed Stream Processing Engine. In GSN, pages 66--85, 2006.

[18]

P. Tobias and D. Trindade. Applied Reliability, Third Edition. Taylor & Francis, 2011.

[19]

P. Upadhyaya, Y. Kwon, and M. Balazinska. A latency and fault-tolerance optimizer for online parallel query plans. In SIGMOD Conference, pages 241--252, 2011.

Digital Library

[20]

F. M. Waas. Beyond Conventional Data Warehousing - Massively Parallel Data Processing with Greenplum Database. In BIRTE (Informal Proceedings), 2008.

[21]

T. White. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 1st edition, 2009.

Digital Library

[22]

R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and rich analytics at scale. In SIGMOD Conference, pages 13--24, 2013.

Digital Library

[23]

C. Yang, C. Yen, C. Tan, and S. Madden. Osprey: Implementing MapReduce-style fault tolerance in a shared-nothing distributed database. In ICDE, pages 657--668, 2010.

[24]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, pages 15--28, 2012.

Digital Library

[25]

J. Zhou, N. Bruno, M.-C. Wu, P.-Å. Larson, R. Chaiken, and D. Shakib. SCOPE: parallel databases meet MapReduce. VLDB J., 21(5), 2012.

Digital Library

Cited By

Bai WFang JChang W(2023)IoT Service Runtime Fault Tolerance Mechanism Based on Flink Dynamic CheckpointService Science10.1007/978-981-99-4402-6_7(91-105)Online publication date: 27-Jul-2023
https://doi.org/10.1007/978-981-99-4402-6_7
BESSHO YHAYAMIZU YGODA KKITSUREGAWA M(2022)Dynamic Fault Tolerance for Multi-Node Query ProcessingIEICE Transactions on Information and Systems10.1587/transinf.2021DAP0004E105.D:5(909-919)Online publication date: 1-May-2022
https://doi.org/10.1587/transinf.2021DAP0004
Zhu YInterlandi MRoy ADas KPatel HBag MSharma HJindal A(2021)PhoebeProceedings of the VLDB Endowment10.14778/3476249.347629814:11(2505-2518)Online publication date: 27-Oct-2021
https://dl.acm.org/doi/10.14778/3476249.3476298
Show More Cited By

Index Terms

Cost-based Fault-tolerance for Parallel Data Processing
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
      2. Parallel and distributed DBMSs
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

View-based query processing: On the relationship between rewriting, answering and losslessness

As a result of the extensive research in view-based query processing, three notions have been identified as fundamental, namely rewriting, answering, and losslessness. Answering amounts to computing the tuples satisfying the query in all databases ...
Query optimization for massively parallel data processing
SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud Computing

MapReduce has been widely recognized as an efficient tool for large-scale data analysis. It achieves high performance by exploiting parallelism among processing nodes while providing a simple interface for upper-layer applications. Some vendors have ...
View-based query processing: on the relationship between rewriting, answering and losslessness
ICDT'05: Proceedings of the 10th international conference on Database Theory

As a result of the extensive research in view-based query processing, three notions have been identi.ed as fundamental, namely rewriting, answering, and losslessness. Answering amounts to computing the tuples satisfying the query in all databases ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

May 2015

2110 pages

ISBN:9781450327589

DOI:10.1145/2723372

General Chair:
Timos Sellis
RMIT University, Australia
,
Program Chairs:
Susan B. Davidson
University of Pennsylvania, USA
,
Zack Ives
University of Pennsylvania, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

SIGMOD/PODS'15

Sponsor:

SIGMOD

SIGMOD/PODS'15: International Conference on Management of Data

May 31 - June 4, 2015

Victoria, Melbourne, Australia

Acceptance Rates

SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
720
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)2

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bai WFang JChang W(2023)IoT Service Runtime Fault Tolerance Mechanism Based on Flink Dynamic CheckpointService Science10.1007/978-981-99-4402-6_7(91-105)Online publication date: 27-Jul-2023
https://doi.org/10.1007/978-981-99-4402-6_7
BESSHO YHAYAMIZU YGODA KKITSUREGAWA M(2022)Dynamic Fault Tolerance for Multi-Node Query ProcessingIEICE Transactions on Information and Systems10.1587/transinf.2021DAP0004E105.D:5(909-919)Online publication date: 1-May-2022
https://doi.org/10.1587/transinf.2021DAP0004
Zhu YInterlandi MRoy ADas KPatel HBag MSharma HJindal A(2021)PhoebeProceedings of the VLDB Endowment10.14778/3476249.347629814:11(2505-2518)Online publication date: 27-Oct-2021
https://dl.acm.org/doi/10.14778/3476249.3476298
Ji YChai YZhou XRen LQin Y(2019)Smart Intra-query Fault Tolerance for Massive Parallel Processing DatabasesData Science and Engineering10.1007/s41019-019-00114-z5:1(65-79)Online publication date: 19-Dec-2019
https://doi.org/10.1007/s41019-019-00114-z
Fang JChao PZhang RZhou X(2019)Integrating workload balancing and fault tolerance in distributed stream processing systemWorld Wide Web10.1007/s11280-018-0656-022:6(2471-2496)Online publication date: 7-Jan-2019
https://doi.org/10.1007/s11280-018-0656-0
Zhuang YWei XLi HWang YHe X(2019)An optimal checkpointing model with online OCI adjustment for stream processing applicationsConcurrency and Computation: Practice and Experience10.1002/cpe.534731:20Online publication date: 10-Jun-2019
https://doi.org/10.1002/cpe.5347
Zhuang YWei XLi HWang YHe X(2018)An Optimal Checkpointing Model with Online OCI Adjustment for Stream Processing Applications2018 27th International Conference on Computer Communication and Networks (ICCCN)10.1109/ICCCN.2018.8487327(1-9)Online publication date: Jul-2018
https://doi.org/10.1109/ICCCN.2018.8487327
Li HWu JJiang ZLi XWei X(2018)A Task Allocation Method for Stream Processing with Recovery Latency ConstraintJournal of Computer Science and Technology10.1007/s11390-018-1876-633:6(1125-1139)Online publication date: 19-Nov-2018
https://doi.org/10.1007/s11390-018-1876-6
Li HWu JJiang ZLi XWei X(2017)Minimum Backups for Stream Processing With Recovery Latency GuaranteesIEEE Transactions on Reliability10.1109/TR.2017.271256366:3(783-794)Online publication date: Sep-2017
https://doi.org/10.1109/TR.2017.2712563
Li HWu JJiang ZLi XWei XZhuang Y(2017)Integrated recovery and task allocation for stream processing2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC)10.1109/PCCC.2017.8280443(1-8)Online publication date: Dec-2017
https://doi.org/10.1109/PCCC.2017.8280443
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents