Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2588555.2595636acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

HAWQ: a massively parallel processing SQL engine in hadoop

Published: 18 June 2014 Publication History

Abstract

HAWQ, developed at Pivotal, is a massively parallel processing SQL engine sitting on top of HDFS. As a hybrid of MPP database and Hadoop, it inherits the merits from both parties. It adopts a layered architecture and relies on the distributed file system for data replication and fault tolerance. In addition, it is standard SQL compliant, and unlike other SQL engines on Hadoop, it is fully transactional. This paper presents the novel design of HAWQ, including query processing, the scalable software interconnect based on UDP protocol, transaction management, fault tolerance, read optimized storage, the extensible framework for supporting various popular Hadoop based data stores and formats, and various optimization choices we considered to enhance the query performance. The extensive performance study shows that HAWQ is about 40x faster than Stinger, which is reported 35x-45x faster than the original Hive.

References

[1]
Apache hadoop, http://hadoop.apache.org/.
[2]
Asterdata, http://www.asterdata.com/sqlh/.
[3]
Drill, http://incubator.apache.org/drill/.
[4]
Greenplum database, http://www.gopivotal.com.
[5]
Impala, https://github.com/cloudera/impala.
[6]
Netezza, http://www-01.ibm.com/software/data/netezza/.
[7]
Presto, http://prestodb.io.
[8]
Spark, http://spark.incubator.apache.org/.
[9]
Stinger, http://hortonworks.com/labs/stinger/.
[10]
TPC-H on hive, https://github.com/rxin/TPC-H-Hive.
[11]
Vertica, http://www.vertica.com.
[12]
A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In VLDB, 2009.
[13]
A. Ailamaki, D. J. Dewitt, M. D. Hill, and M. Skounakis. Weaving relations for cache performance. In VLDB, 2001.
[14]
H. Berenson, P. Bernstein, J. Gray, J. Melton, E. O'Neil, and P. O'Neil. A critique of ANSI SQL isolation levels. In SIGMOD, 1995.
[15]
C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: Easy, efficient data-parallel pipelines. In PLDI, 2010.
[16]
B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, and et al. Tenzing: A SQL implementation on the MapReduce framework. In VLDB, 2011.
[17]
J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD skills: New analysis practices for big data. In VLDB, 2009.
[18]
J. C. Corbett, J. Dean, M. Epstein, and et al. Spanner: Google's globally-distributed database. In OSDI, 2012.
[19]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004.
[20]
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.
[21]
D. J. DeWitt, S. Ghandeharizadeh, D. Schneider, A. Bricker, H. Hsiao, and R. Rasmussen. The Gamma database machine project. IEEE Trans. Knowl. Data Eng., 2(1):44--62, 1990.
[22]
D. J. DeWitt, A. Halverson, R. V. Nehme, S. Shankar, J. Aguilar-Saborit, A. Avanes, M. Flasza, and J. Gramling. Split query processing in Polybase. In SIGMOD, 2013.
[23]
G. Graefe. Encapsulation of parallelism in the Volcano query processing system. In SIGMOD, 1990.
[24]
S. Harizopoulos, P. A. Boncz, and S. Harizopoulos. Column-oriented database systems. In VLDB, 2009.
[25]
J. Hellerstein, C. RÈ, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, and et al. The MADlib analytics library or MAD skills, the SQL. In VLDB, 2012.
[26]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys, 2007.
[27]
D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The performance of MapReduce: An in-depth study. In VLDB, 2010.
[28]
D. J. A. Kamil Bajda-Pawlikowski and, A. Silberschatz, and E. Paulson. Efficient processing of data warehousing queries in a split execution environment. In SIGMOD, 2011.
[29]
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. In VLDB, 2010.
[30]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a not-so-foreign language for data processing. In SIGMOD, 2008.
[31]
F." ozcan, D. Hoa, K. S. Beyer, A. Balmin, C. J. Liu, and Y. Li. Emerging trends in the enterprise data analytics: connecting Hadoop and DB2 warehouse. In SIGMOD, 2011.
[32]
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009.
[33]
D. R. K. Ports and K. Grittner. Serializable snapshot isolation in PostgreSQL. In VLDB, 2012.
[34]
J. Shute, R. Vingralek, B. Samwel, and et al. F1: A distributed SQL database that scales. In VLDB, 2013.
[35]
X. Su and G. Swart. Oracle in-database Hadoop: When MapReduce meets RDBMS. In SIGMOD, 2012.
[36]
A. Thusoo, J. S. Sarma, N. Jain, and et al. Hive - a warehousing solution over a MapReduce framework. In VLDB, 2009.
[37]
A. Thusoo, J. S. Sarma, N. Jain, and et al. Hive - a petabyte scale data warehouse using Hadoop. In ICDE, 2010.
[38]
R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and rich analytics at scale. In SIGMOD, 2013.
[39]
Y. Yu, M. Isard, D. Fetterly, and et al. DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, 2008.
[40]
M. Zaharia, M. Chowdhury, T. Das, and et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012.

Cited By

View all
  • (2024)A Design of Hybrid Transactional and Analytical Processing Database for Energy Efficient Big Data QueriesGreen, Pervasive, and Cloud Computing10.1007/978-981-99-9893-7_10(128-138)Online publication date: 23-Jan-2024
  • (2023)Using Cloud Functions as Accelerator for Elastic Data AnalyticsProceedings of the ACM on Management of Data10.1145/35893061:2(1-27)Online publication date: 20-Jun-2023
  • (2022)Karst: Transactional Data Ingestion Without Blocking on a Scalable ArchitectureIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.301151034:5(2241-2253)Online publication date: 1-May-2022
  • Show More Cited By

Index Terms

  1. HAWQ: a massively parallel processing SQL engine in hadoop

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
    June 2014
    1645 pages
    ISBN:9781450323765
    DOI:10.1145/2588555
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 June 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Hadoop
    2. database
    3. query processing

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS'14
    Sponsor:

    Acceptance Rates

    SIGMOD '14 Paper Acceptance Rate 107 of 421 submissions, 25%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)27
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 14 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Design of Hybrid Transactional and Analytical Processing Database for Energy Efficient Big Data QueriesGreen, Pervasive, and Cloud Computing10.1007/978-981-99-9893-7_10(128-138)Online publication date: 23-Jan-2024
    • (2023)Using Cloud Functions as Accelerator for Elastic Data AnalyticsProceedings of the ACM on Management of Data10.1145/35893061:2(1-27)Online publication date: 20-Jun-2023
    • (2022)Karst: Transactional Data Ingestion Without Blocking on a Scalable ArchitectureIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.301151034:5(2241-2253)Online publication date: 1-May-2022
    • (2022)Optimal Subgraph Matching Queries over Distributed Knowledge Graphs Based on Partial EvaluationWeb Information Systems Engineering – WISE 202110.1007/978-3-030-90888-1_22(274-289)Online publication date: 1-Jan-2022
    • (2021)Trends and Technologies in Big Data ProcessingResearch Anthology on Privatizing and Securing Data10.4018/978-1-7998-8954-0.ch003(42-67)Online publication date: 2021
    • (2020)Trends and Technologies in Big Data ProcessingInnovations, Algorithms, and Applications in Cognitive Informatics and Natural Intelligence10.4018/978-1-7998-3038-2.ch002(17-42)Online publication date: 2020
    • (2020)CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High PerformanceJournal of Computer Science and Technology10.1007/s11390-020-9536-z35:1(194-208)Online publication date: 17-Jan-2020
    • (2019)Integration of large-scale data processing systems and traditional parallel database technologyProceedings of the VLDB Endowment10.14778/3352063.335214512:12(2290-2299)Online publication date: 1-Aug-2019
    • (2019)Upgrading a high performance computing environment for massive data processingJournal of Internet Services and Applications10.1186/s13174-019-0118-710:1Online publication date: 16-Oct-2019
    • (2019)A High-Performance Distributed Relational Database System for Scalable OLAP Processing2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2019.00083(738-748)Online publication date: May-2019
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media