research-article

HAWQ: a massively parallel processing SQL engine in hadoop

Authors:

Milind BhandarkarAuthors Info & Claims

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Pages 1223 - 1234

https://doi.org/10.1145/2588555.2595636

Published: 18 June 2014 Publication History

Abstract

HAWQ, developed at Pivotal, is a massively parallel processing SQL engine sitting on top of HDFS. As a hybrid of MPP database and Hadoop, it inherits the merits from both parties. It adopts a layered architecture and relies on the distributed file system for data replication and fault tolerance. In addition, it is standard SQL compliant, and unlike other SQL engines on Hadoop, it is fully transactional. This paper presents the novel design of HAWQ, including query processing, the scalable software interconnect based on UDP protocol, transaction management, fault tolerance, read optimized storage, the extensible framework for supporting various popular Hadoop based data stores and formats, and various optimization choices we considered to enhance the query performance. The extensive performance study shows that HAWQ is about 40x faster than Stinger, which is reported 35x-45x faster than the original Hive.

References

[1]

Apache hadoop, http://hadoop.apache.org/.

[2]

Asterdata, http://www.asterdata.com/sqlh/.

[3]

Drill, http://incubator.apache.org/drill/.

[4]

Greenplum database, http://www.gopivotal.com.

[5]

Impala, https://github.com/cloudera/impala.

[6]

Netezza, http://www-01.ibm.com/software/data/netezza/.

[7]

Presto, http://prestodb.io.

[8]

Spark, http://spark.incubator.apache.org/.

[9]

Stinger, http://hortonworks.com/labs/stinger/.

[10]

TPC-H on hive, https://github.com/rxin/TPC-H-Hive.

[11]

Vertica, http://www.vertica.com.

[12]

A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In VLDB, 2009.

Digital Library

[13]

A. Ailamaki, D. J. Dewitt, M. D. Hill, and M. Skounakis. Weaving relations for cache performance. In VLDB, 2001.

Digital Library

[14]

H. Berenson, P. Bernstein, J. Gray, J. Melton, E. O'Neil, and P. O'Neil. A critique of ANSI SQL isolation levels. In SIGMOD, 1995.

Digital Library

[15]

C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: Easy, efficient data-parallel pipelines. In PLDI, 2010.

Digital Library

[16]

B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, and et al. Tenzing: A SQL implementation on the MapReduce framework. In VLDB, 2011.

Digital Library

[17]

J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD skills: New analysis practices for big data. In VLDB, 2009.

Digital Library

[18]

J. C. Corbett, J. Dean, M. Epstein, and et al. Spanner: Google's globally-distributed database. In OSDI, 2012.

Digital Library

[19]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004.

Digital Library

[20]

J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.

Digital Library

[21]

D. J. DeWitt, S. Ghandeharizadeh, D. Schneider, A. Bricker, H. Hsiao, and R. Rasmussen. The Gamma database machine project. IEEE Trans. Knowl. Data Eng., 2(1):44--62, 1990.

Digital Library

[22]

D. J. DeWitt, A. Halverson, R. V. Nehme, S. Shankar, J. Aguilar-Saborit, A. Avanes, M. Flasza, and J. Gramling. Split query processing in Polybase. In SIGMOD, 2013.

Digital Library

[23]

G. Graefe. Encapsulation of parallelism in the Volcano query processing system. In SIGMOD, 1990.

Digital Library

[24]

S. Harizopoulos, P. A. Boncz, and S. Harizopoulos. Column-oriented database systems. In VLDB, 2009.

[25]

J. Hellerstein, C. RÈ, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, and et al. The MADlib analytics library or MAD skills, the SQL. In VLDB, 2012.

Digital Library

[26]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys, 2007.

Digital Library

[27]

D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The performance of MapReduce: An in-depth study. In VLDB, 2010.

Digital Library

[28]

D. J. A. Kamil Bajda-Pawlikowski and, A. Silberschatz, and E. Paulson. Efficient processing of data warehousing queries in a split execution environment. In SIGMOD, 2011.

Digital Library

[29]

S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. In VLDB, 2010.

Digital Library

[30]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a not-so-foreign language for data processing. In SIGMOD, 2008.

Digital Library

[31]

F." ozcan, D. Hoa, K. S. Beyer, A. Balmin, C. J. Liu, and Y. Li. Emerging trends in the enterprise data analytics: connecting Hadoop and DB2 warehouse. In SIGMOD, 2011.

Digital Library

[32]

A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009.

Digital Library

[33]

D. R. K. Ports and K. Grittner. Serializable snapshot isolation in PostgreSQL. In VLDB, 2012.

Digital Library

[34]

J. Shute, R. Vingralek, B. Samwel, and et al. F1: A distributed SQL database that scales. In VLDB, 2013.

Digital Library

[35]

X. Su and G. Swart. Oracle in-database Hadoop: When MapReduce meets RDBMS. In SIGMOD, 2012.

Digital Library

[36]

A. Thusoo, J. S. Sarma, N. Jain, and et al. Hive - a warehousing solution over a MapReduce framework. In VLDB, 2009.

Digital Library

[37]

A. Thusoo, J. S. Sarma, N. Jain, and et al. Hive - a petabyte scale data warehouse using Hadoop. In ICDE, 2010.

[38]

R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and rich analytics at scale. In SIGMOD, 2013.

Digital Library

[39]

Y. Yu, M. Isard, D. Fetterly, and et al. DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, 2008.

Digital Library

[40]

M. Zaharia, M. Chowdhury, T. Das, and et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012.

Digital Library

Cited By

Lin W(2024)A Design of Hybrid Transactional and Analytical Processing Database for Energy Efficient Big Data QueriesGreen, Pervasive, and Cloud Computing10.1007/978-981-99-9893-7_10(128-138)Online publication date: 23-Jan-2024
https://doi.org/10.1007/978-981-99-9893-7_10
Bian HSha TAilamaki A(2023)Using Cloud Functions as Accelerator for Elastic Data AnalyticsProceedings of the ACM on Management of Data10.1145/35893061:2(1-27)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589306
Li ZPeng BHuang QWeng C(2022)Karst: Transactional Data Ingestion Without Blocking on a Scalable ArchitectureIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.301151034:5(2241-2253)Online publication date: 1-May-2022
https://doi.org/10.1109/TKDE.2020.3011510
Show More Cited By

Index Terms

HAWQ: a massively parallel processing SQL engine in hadoop
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs

Recommendations

Unified analytics platform for big data
WICSA/ECSA '12: Proceedings of the WICSA/ECSA 2012 Companion Volume

Greenplum is using Hadoop and several other open source tools in interesting ways as part of a big data architecture with their Greenplum Database (a scale-out MPP SQL database).
MISO: souping up big data query processing with a multistore system
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Multistore systems utilize multiple distinct data stores such as Hadoop's HDFS and an RDBMS for query processing by allowing a query to access data and computation in both stores. Current approaches to multistore query processing fail to achieve the ...
Clydesdale: structured data processing on hadoop
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

There have been several recent proposals modifying Hadoop, radically changing the storage organization or query processing techniques to obtain good performance for structured data processing. We will showcase Clydesdale, a research prototype for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

June 2014

1645 pages

ISBN:9781450323765

DOI:10.1145/2588555

General Chairs:
Curtis Dyreson
Utah State University, USA
,
Feifei Li
University of Utah, USA
,
Program Chair:
M. Tamer Özsu
University of Waterloo, Canada

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'14

Sponsor:

SIGMOD

SIGMOD/PODS'14: International Conference on Management of Data

June 22 - 27, 2014

Utah, Snowbird, USA

Acceptance Rates

SIGMOD '14 Paper Acceptance Rate 107 of 421 submissions, 25%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

38
Total Citations
View Citations
1,343
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)1

Reflects downloads up to 14 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lin W(2024)A Design of Hybrid Transactional and Analytical Processing Database for Energy Efficient Big Data QueriesGreen, Pervasive, and Cloud Computing10.1007/978-981-99-9893-7_10(128-138)Online publication date: 23-Jan-2024
https://doi.org/10.1007/978-981-99-9893-7_10
Bian HSha TAilamaki A(2023)Using Cloud Functions as Accelerator for Elastic Data AnalyticsProceedings of the ACM on Management of Data10.1145/35893061:2(1-27)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589306
Li ZPeng BHuang QWeng C(2022)Karst: Transactional Data Ingestion Without Blocking on a Scalable ArchitectureIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.301151034:5(2241-2253)Online publication date: 1-May-2022
https://doi.org/10.1109/TKDE.2020.3011510
Xing JLiu BLi JChoudhury FWang X(2022)Optimal Subgraph Matching Queries over Distributed Knowledge Graphs Based on Partial EvaluationWeb Information Systems Engineering – WISE 202110.1007/978-3-030-90888-1_22(274-289)Online publication date: 1-Jan-2022
https://doi.org/10.1007/978-3-030-90888-1_22
Choudhury ARangra K(2021)Trends and Technologies in Big Data ProcessingResearch Anthology on Privatizing and Securing Data10.4018/978-1-7998-8954-0.ch003(42-67)Online publication date: 2021
https://doi.org/10.4018/978-1-7998-8954-0.ch003
Choudhury ARangra K(2020)Trends and Technologies in Big Data ProcessingInnovations, Algorithms, and Applications in Cognitive Informatics and Natural Intelligence10.4018/978-1-7998-3038-2.ch002(17-42)Online publication date: 2020
https://doi.org/10.4018/978-1-7998-3038-2.ch002
Jin ZShi HHu YZha LLu X(2020)CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High PerformanceJournal of Computer Science and Technology10.1007/s11390-020-9536-z35:1(194-208)Online publication date: 17-Jan-2020
https://doi.org/10.1007/s11390-020-9536-z
Abouzied AAbadi DBajda-Pawlikowski KSilberschatz A(2019)Integration of large-scale data processing systems and traditional parallel database technologyProceedings of the VLDB Endowment10.14778/3352063.335214512:12(2290-2299)Online publication date: 1-Aug-2019
https://dl.acm.org/doi/10.14778/3352063.3352145
Ponce LSantos WMeira WGuedes DLezzi DBadia R(2019)Upgrading a high performance computing environment for massive data processingJournal of Internet Services and Applications10.1186/s13174-019-0118-710:1Online publication date: 16-Oct-2019
https://doi.org/10.1186/s13174-019-0118-7
Arnold JGlavic BRaicu I(2019)A High-Performance Distributed Relational Database System for Scalable OLAP Processing2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2019.00083(738-748)Online publication date: May-2019
https://doi.org/10.1109/IPDPS.2019.00083
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents