Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2882903.2903742acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

VectorH: Taking SQL-on-Hadoop to the Next Level

Published: 14 June 2016 Publication History
  • Get Citation Alerts
  • Abstract

    Actian Vector in Hadoop (VectorH for short) is a new SQL-on-Hadoop system built on top of the fast Vectorwise analytical database system. VectorH achieves fault tolerance and storage scalability by relying on HDFS, and extends the state-of-the-art in SQL-on-Hadoop systems by instrumenting the HDFS replication policy to optimize read locality. VectorH integrates with YARN for workload management, achieving a high degree of elasticity. Even though HDFS is an append-only filesystem, and VectorH supports (update-averse) ordered tables, trickle updates are possible thanks to Positional Delta Trees (PDTs), a differential update structure that can be queried efficiently. We describe the changes made to single-server Vectorwise to turn it into a Hadoop-based MPP system, encompassing workload management, parallel query optimization and execution, HDFS storage, transaction processing and Spark integration. We evaluate VectorH against HAWQ, Impala, SparkSQL and Hive, showing orders of magnitude better performance.

    References

    [1]
    A. Ailamaki, D. DeWitt, M. Hill, and M. Skounakis. Weaving relations for cache performance. In PVLDB, 2001.
    [2]
    K. Anikiej. Multi-core parallelization of vectorized query execution. MSc thesis, VU University, 2010.
    [3]
    M. Armbrust, R. Xin, et al. Spark SQL: Relational data processing in Spark. In SIGMOD, 2015.
    [4]
    P. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: hyper-pipelining query execution. In CIDR, volume 5, 2005.
    [5]
    C. Bârcă. Dynamic Resource Management in Vectorwise on Hadoop. MSc thesis, VU University Amsterdam, 2014.
    [6]
    L. Chang, Z. Wang, T. Ma, L. Jian, L. Ma, A. Goldshuv, L. Lonergan, et al. HAWQ: a massively parallel processing SQL engine in hadoop. In SIGMOD, 2014.
    [7]
    A. Costea and A. Ionescu. Query optimization and execution in Vectorwise MPP. MSc thesis, VU University, 2012.
    [8]
    A. Floratou, U. F. Minhas, and F. Özcan. SQL-on-Hadoop: Full circle back to shared-nothing database architectures. PVLDB, 7(12), 2014.
    [9]
    A. Floratou, J. Patel, E. Shekita, and S. Tata. Column-oriented storage techniques for mapreduce. PVLDB, 4(7), 2011.
    [10]
    G. Graefe. Encapsulation of parallelism in the Volcano query processing system, volume 19. 1990.
    [11]
    S. Héman. Updating Compressed Column Stores. PhD thesis, VU University, 2015.
    [12]
    S. Héman, M. Zukowski, N. J. Nes, L. Sidirourgos, and P. Boncz. Positional update handling in column stores. In SIGMOD, 2010.
    [13]
    Y. Huai, A. Chauhan, A. Gates, G. Hagleitner, E. Hanson, et al. Major technical advancements in Apache Hive. In SIGMOD, 2014.
    [14]
    Y. Huai, S. Ma, R. Lee, O. O'Malley, and X. Zhang... table placement methods in clusters. PVLDB, 6(14), 2013.
    [15]
    M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: fair scheduling for distributed computing clusters. In SOSP, 2009.
    [16]
    M. Kornacker et al. Impala: A modern, open-source sql engine for hadoop. In CIDR, 2015.
    [17]
    P.-Å. Larson, C. Clinciu, E. Hanson, A. Oks, S. Price, S. Rangarajan, A. Surna, and Q. Zhou. SQL server column store indexes. In SIGMOD, 2011.
    [18]
    S. Melnik, A. Gubarev, J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. PVLDB, 3(1--2), 2010.
    [19]
    T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9), 2011.
    [20]
    V. Raman et al. DB2 with BLU acceleration: So much more than just a column store. PVLDB, 6(11), 2013.
    [21]
    W. Rödiger, T. Mühlbauer, A. Kemper, and T. Neumann. High-speed query processing over high-speed networks. PVLDB, 9(4), 2015.
    [22]
    M. A. Soliman et al. Orca: a modular query optimizer architecture for big data. In SIGMOD, 2014.
    [23]
    M. 'Switakowski, P. Boncz, and M. Zukowski. From cooperative scans to predictive buffer management. PVLDB, 5(12), 2012.
    [24]
    S. Wanderman-Milne and N. Li. Runtime code generation in cloudera impala. DEBULL, 37(1), 2014.
    [25]
    S. Whoerl. Efficient relational main-memory query processing for Hadoop Parquet Nested Columnar storage with HyPer and Vectorwise. MSc thesis, CWI/LMU/TUM/U. Augsburg, 2014.
    [26]
    M. Zaharia, M. Chowdhury, M. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In USENIX, volume 10, 2010.
    [27]
    M. Zukowski. Balancing Vectorized Query Execution with Bandwidth-Optimized Storage. PhD thesis, 2009.
    [28]
    M. Zukowski, S. Héman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In ICDE, 2006.

    Cited By

    View all
    • (2024)So Far and yet so Near - Accelerating Distributed Joins with CXLProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663449(1-9)Online publication date: 10-Jun-2024
    • (2022)Karst: Transactional Data Ingestion Without Blocking on a Scalable ArchitectureIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.301151034:5(2241-2253)Online publication date: 1-May-2022
    • (2021)Using Vectorized Execution to Improve SQL Query Performance on SparkProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472495(1-11)Online publication date: 9-Aug-2021
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
    June 2016
    2300 pages
    ISBN:9781450335317
    DOI:10.1145/2882903
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 June 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. SQL-on-Hadoop
    2. column stores
    3. distributed systems
    4. parallel database systems
    5. query optimization
    6. query processing

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS'16
    Sponsor:
    SIGMOD/PODS'16: International Conference on Management of Data
    June 26 - July 1, 2016
    California, San Francisco, USA

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)15
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 09 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)So Far and yet so Near - Accelerating Distributed Joins with CXLProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663449(1-9)Online publication date: 10-Jun-2024
    • (2022)Karst: Transactional Data Ingestion Without Blocking on a Scalable ArchitectureIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.301151034:5(2241-2253)Online publication date: 1-May-2022
    • (2021)Using Vectorized Execution to Improve SQL Query Performance on SparkProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472495(1-11)Online publication date: 9-Aug-2021
    • (2020)CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High PerformanceJournal of Computer Science and Technology10.1007/s11390-020-9536-z35:1(194-208)Online publication date: 17-Jan-2020
    • (2020)VIP: A SIMD vectorized analytical query engineThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-020-00621-w29:6(1243-1261)Online publication date: 13-Jul-2020
    • (2019)HaRD: a heterogeneity-aware replica deletion for HDFSJournal of Big Data10.1186/s40537-019-0256-66:1Online publication date: 21-Oct-2019
    • (2019)Towards Practical Vectorized Analytical Query EnginesProceedings of the 15th International Workshop on Data Management on New Hardware10.1145/3329785.3329928(1-7)Online publication date: 1-Jul-2019
    • (2019)General-Purpose vs. Specialized Data Analytics Systems: A Game of ML & SQL Thrones2019 IEEE International Conference on Big Data (Big Data)10.1109/BigData47090.2019.9006412(317-326)Online publication date: Dec-2019
    • (2019)The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and ParquetConcurrency and Computation: Practice and Experience10.1002/cpe.552332:5Online publication date: 9-Sep-2019
    • (2018)Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs2018 IEEE 11th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD.2018.00038(245-252)Online publication date: Jul-2018
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media