research-article

VectorH: Taking SQL-on-Hadoop to the Next Level

Authors:

Adrian Ionescu,

Bogdan Răducanu,

Michał Switakowski,

Cristian Bârca,

Juliusz Sompolski,

Alicja Łuszczak,

Michał Szafrański,

Peter BonczAuthors Info & Claims

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Pages 1105 - 1117

https://doi.org/10.1145/2882903.2903742

Published: 14 June 2016 Publication History

Abstract

Actian Vector in Hadoop (VectorH for short) is a new SQL-on-Hadoop system built on top of the fast Vectorwise analytical database system. VectorH achieves fault tolerance and storage scalability by relying on HDFS, and extends the state-of-the-art in SQL-on-Hadoop systems by instrumenting the HDFS replication policy to optimize read locality. VectorH integrates with YARN for workload management, achieving a high degree of elasticity. Even though HDFS is an append-only filesystem, and VectorH supports (update-averse) ordered tables, trickle updates are possible thanks to Positional Delta Trees (PDTs), a differential update structure that can be queried efficiently. We describe the changes made to single-server Vectorwise to turn it into a Hadoop-based MPP system, encompassing workload management, parallel query optimization and execution, HDFS storage, transaction processing and Spark integration. We evaluate VectorH against HAWQ, Impala, SparkSQL and Hive, showing orders of magnitude better performance.

References

[1]

A. Ailamaki, D. DeWitt, M. Hill, and M. Skounakis. Weaving relations for cache performance. In PVLDB, 2001.

Digital Library

[2]

K. Anikiej. Multi-core parallelization of vectorized query execution. MSc thesis, VU University, 2010.

[3]

M. Armbrust, R. Xin, et al. Spark SQL: Relational data processing in Spark. In SIGMOD, 2015.

Digital Library

[4]

P. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: hyper-pipelining query execution. In CIDR, volume 5, 2005.

[5]

C. Bârcă. Dynamic Resource Management in Vectorwise on Hadoop. MSc thesis, VU University Amsterdam, 2014.

[6]

L. Chang, Z. Wang, T. Ma, L. Jian, L. Ma, A. Goldshuv, L. Lonergan, et al. HAWQ: a massively parallel processing SQL engine in hadoop. In SIGMOD, 2014.

Digital Library

[7]

A. Costea and A. Ionescu. Query optimization and execution in Vectorwise MPP. MSc thesis, VU University, 2012.

[8]

A. Floratou, U. F. Minhas, and F. Özcan. SQL-on-Hadoop: Full circle back to shared-nothing database architectures. PVLDB, 7(12), 2014.

Digital Library

[9]

A. Floratou, J. Patel, E. Shekita, and S. Tata. Column-oriented storage techniques for mapreduce. PVLDB, 4(7), 2011.

Digital Library

[10]

G. Graefe. Encapsulation of parallelism in the Volcano query processing system, volume 19. 1990.

Digital Library

[11]

S. Héman. Updating Compressed Column Stores. PhD thesis, VU University, 2015.

[12]

S. Héman, M. Zukowski, N. J. Nes, L. Sidirourgos, and P. Boncz. Positional update handling in column stores. In SIGMOD, 2010.

Digital Library

[13]

Y. Huai, A. Chauhan, A. Gates, G. Hagleitner, E. Hanson, et al. Major technical advancements in Apache Hive. In SIGMOD, 2014.

Digital Library

[14]

Y. Huai, S. Ma, R. Lee, O. O'Malley, and X. Zhang... table placement methods in clusters. PVLDB, 6(14), 2013.

Digital Library

[15]

M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: fair scheduling for distributed computing clusters. In SOSP, 2009.

Digital Library

[16]

M. Kornacker et al. Impala: A modern, open-source sql engine for hadoop. In CIDR, 2015.

[17]

P.-Å. Larson, C. Clinciu, E. Hanson, A. Oks, S. Price, S. Rangarajan, A. Surna, and Q. Zhou. SQL server column store indexes. In SIGMOD, 2011.

Digital Library

[18]

S. Melnik, A. Gubarev, J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. PVLDB, 3(1--2), 2010.

Digital Library

[19]

T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9), 2011.

Digital Library

[20]

V. Raman et al. DB2 with BLU acceleration: So much more than just a column store. PVLDB, 6(11), 2013.

Digital Library

[21]

W. Rödiger, T. Mühlbauer, A. Kemper, and T. Neumann. High-speed query processing over high-speed networks. PVLDB, 9(4), 2015.

Digital Library

[22]

M. A. Soliman et al. Orca: a modular query optimizer architecture for big data. In SIGMOD, 2014.

Digital Library

[23]

M. 'Switakowski, P. Boncz, and M. Zukowski. From cooperative scans to predictive buffer management. PVLDB, 5(12), 2012.

Digital Library

[24]

S. Wanderman-Milne and N. Li. Runtime code generation in cloudera impala. DEBULL, 37(1), 2014.

[25]

S. Whoerl. Efficient relational main-memory query processing for Hadoop Parquet Nested Columnar storage with HyPer and Vectorwise. MSc thesis, CWI/LMU/TUM/U. Augsburg, 2014.

[26]

M. Zaharia, M. Chowdhury, M. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In USENIX, volume 10, 2010.

Digital Library

[27]

M. Zukowski. Balancing Vectorized Query Execution with Bandwidth-Optimized Storage. PhD thesis, 2009.

[28]

M. Zukowski, S. Héman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In ICDE, 2006.

Digital Library

Cited By

Baumstark AParadies MSattler KKläbe SBaumann S(2024)So Far and yet so Near - Accelerating Distributed Joins with CXLProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663449(1-9)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663449
Li ZPeng BHuang QWeng C(2022)Karst: Transactional Data Ingestion Without Blocking on a Scalable ArchitectureIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.301151034:5(2241-2253)Online publication date: 1-May-2022
https://doi.org/10.1109/TKDE.2020.3011510
Shen YXiong JJiang D(2021)Using Vectorized Execution to Improve SQL Query Performance on SparkProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472495(1-11)Online publication date: 9-Aug-2021
https://dl.acm.org/doi/10.1145/3472456.3472495
Show More Cited By

Index Terms

VectorH: Taking SQL-on-Hadoop to the Next Level
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
      2. Parallel and distributed DBMSs

Recommendations

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware
IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium

Big Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-...
Take me to SSD: a hybrid block-selection method on HDFS based on storage type
SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied Computing

As the era of Big-data has risen, the importance of big data technologies is also increasing day by day. Especially, Hadoop has become a critical part of the overall Big-data system because of its ability to store, process, and analyze thousands of ...
The Stratosphere platform for big data analytics

We present Stratosphere, an open-source software stack for parallel data analysis. Stratosphere brings together a unique set of features that allow the expressive, easy, and efficient programming of analytical applications at very large scale. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

June 2016

2300 pages

ISBN:9781450335317

DOI:10.1145/2882903

General Chairs:
Fatma Özcan
IBM Research, USA
,
Georgia Koutrika
HP Labs, USA
,
Program Chair:
Sam Madden
Massachusetts Institute of Technology, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'16

Sponsor:

SIGMOD

SIGMOD/PODS'16: International Conference on Management of Data

June 26 - July 1, 2016

California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
786
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)2

Reflects downloads up to 09 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Baumstark AParadies MSattler KKläbe SBaumann S(2024)So Far and yet so Near - Accelerating Distributed Joins with CXLProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663449(1-9)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663449
Li ZPeng BHuang QWeng C(2022)Karst: Transactional Data Ingestion Without Blocking on a Scalable ArchitectureIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.301151034:5(2241-2253)Online publication date: 1-May-2022
https://doi.org/10.1109/TKDE.2020.3011510
Shen YXiong JJiang D(2021)Using Vectorized Execution to Improve SQL Query Performance on SparkProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472495(1-11)Online publication date: 9-Aug-2021
https://dl.acm.org/doi/10.1145/3472456.3472495
Jin ZShi HHu YZha LLu X(2020)CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High PerformanceJournal of Computer Science and Technology10.1007/s11390-020-9536-z35:1(194-208)Online publication date: 17-Jan-2020
https://dl.acm.org/doi/10.1007/s11390-020-9536-z
Polychroniou ORoss K(2020)VIP: A SIMD vectorized analytical query engineThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-020-00621-w29:6(1243-1261)Online publication date: 13-Jul-2020
https://dl.acm.org/doi/10.1007/s00778-020-00621-w
Ciritoglu HMurphy JThorpe C(2019)HaRD: a heterogeneity-aware replica deletion for HDFSJournal of Big Data10.1186/s40537-019-0256-66:1Online publication date: 21-Oct-2019
https://doi.org/10.1186/s40537-019-0256-6
Polychroniou ORoss K(2019)Towards Practical Vectorized Analytical Query EnginesProceedings of the 15th International Workshop on Data Management on New Hardware10.1145/3329785.3329928(1-7)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.1145/3329785.3329928
Kassela EProvatas NKonstantinou IFloratou AKoziris N(2019)General-Purpose vs. Specialized Data Analytics Systems: A Game of ML & SQL Thrones2019 IEEE International Conference on Big Data (Big Data)10.1109/BigData47090.2019.9006412(317-326)Online publication date: Dec-2019
https://doi.org/10.1109/BigData47090.2019.9006412
Ivanov TPergolesi M(2019)The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and ParquetConcurrency and Computation: Practice and Experience10.1002/cpe.552332:5Online publication date: 9-Sep-2019
https://doi.org/10.1002/cpe.5523
Chiba TYoshimura THorie MHorii H(2018)Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs2018 IEEE 11th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD.2018.00038(245-252)Online publication date: Jul-2018
https://doi.org/10.1109/CLOUD.2018.00038
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents