research-article

Dynamic speculative optimizations for SQL compilation in Apache Spark

Authors:

Filippo Schiavio,

Daniele Bonetta,

Walter BinderAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 13, Issue 5

Pages 754 - 767

https://doi.org/10.14778/3377369.3377382

Published: 01 January 2020 Publication History

Abstract

Big-data systems have gained significant momentum, and Apache Spark is becoming a de-facto standard for modern data analytics. Spark relies on SQL query compilation to optimize the execution performance of analytical workloads on a variety of data sources. Despite its scalable architecture, Spark's SQL code generation suffers from significant runtime overheads related to data access and de-serialization. Such performance penalty can be significant, especially when applications operate on human-readable data formats such as CSV or JSON.

In this paper we present a new approach to query compilation that overcomes these limitations by relying on run-time profiling and dynamic code generation. Our new SQL compiler for Spark produces highly-efficient machine code, leading to speedups of up to 4.4x on the TPC-H benchmark with textual-form data formats such as CSV or JSON.

References

[1]

M. Armbrust, T. Das, A. Davidson, A. Ghodsi, A. Or, J. Rosen, I. Stoica, P. Wendell, R. Xin, and M. Zaharia. Scaling Spark in the Real World: Performance and Usability. PVLDB, 8(12):1840--1843, 2015.

Digital Library

[2]

M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark SQL: Relational Data Processing in Spark. In SIGMOD, pages 1383--1394, 2015.

Digital Library

[3]

D. Bonetta and M. Brantner. FAD.js: Fast JSON Data Access Using JIT-based Speculative Optimizations. PVLDB, 10(12):1778--1789, 2017.

[4]

P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache Flink: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull., pages 28--38, 2015.

[5]

J. Chen, S. Jindel, R. Walzer, R. Sen, N. Jimsheleishvilli, and M. Andrews. The MemSQL Query Optimizer: A modern optimizer for real-time analytics in a distributed database. PVLDB, 9(13):1401--1412, 2016.

Digital Library

[6]

T. Chiba and T. Onodera. Workload characterization and optimization of TPC-H queries on Apache Spark. In ISPASS, pages 112--121, 2016.

[7]

Databricks. Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop, 2019. https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html.

[8]

Databricks. Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale, 2019. https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html.

[9]

G. Duboscq, T. Würthinger, L. Stadler, C. Wimmer, D. Simon, and H. Mössenböck. An Intermediate Representation for Speculative Optimizations in a Dynamic Compiler. In VMIL, pages 1--10, 2013.

Digital Library

[10]

G. M. Essertel, R. Y. Tahboub, J. M. Decker, K. J. Brown, K. Olukotun, and T. Rompf. Flare: Optimizing Apache Spark with Native Compilation for Scale-up Architectures and Medium-size Data. In OSDI, pages 799--815, 2018.

Digital Library

[11]

ETH DCO. Welcome - Data Center Observatory --- ETH Zurich, 2019. https://wiki.dco.ethz.ch/.

[12]

H. Fang. Managing data lakes in big data era: What's a data lake and why has it became popular in data management ecosystem. In CYBER, pages 820--824, 2015.

[13]

Y. Futamura. Partial Computation of Programs. RIMS Symposia on Software Science and Engineering, 1983.

Digital Library

[14]

N. A. Halli, H.-P. Charles, and J.-F. Méhaut. Performance comparison between Java and JNI for optimal implementation of computational micro-kernels. CoRR, abs/1412.6765, 2014.

[15]

K. Karanasos, A. Balmin, M. Kutsch, F. Ozcan, V. Ercegovac, C. Xia, and J. Jackson. Dynamically Optimizing Queries over Large Scale Data Platforms. In SIGMOD, pages 943--954, 2014.

Digital Library

[16]

M. Karpathiotakis, I. Alagiannis, T. Heinis, M. Branco, and A. Ailamaki. Just-In-Time Data Virtualization: Lightweight Data Management with ViDa. CIDR, 2015.

[17]

A. Kashuba and H. Mühleisen. Automatic Generation of a Hybrid Query Execution Engine. CoRR, 2018.

[18]

T. Kersten, V. Leis, A. Kemper, T. Neumann, A. Pavlo, and P. Boncz. Everything You Always Wanted to Know About Compiled and Vectorized Queries but Were Afraid to Ask. PVLDB, 11(13):2209--2222, 2018.

[19]

A. Kohn, V. Leis, and T. Neumann. Adaptive Execution of Compiled Queries. In ICDE, pages 197--208, 2018.

[20]

Y. Li, N. R. Katsipoulakis, B. Chandramouli, J. Goldstein, and D. Kossmann. Mison: A Fast JSON Parser for Data Analytics. PVLDB, 10(10):1118--1129, 2017.

[21]

Y. Li, M. Li, L. Ding, and M. Interlandi. RIOS: Runtime Integrated Optimizer for Spark. In SoCC, pages 275--287, 2018.

Digital Library

[22]

P. Menon, T. C. Mowry, and A. Pavlo. Relaxed Operator Fusion for In-Memory Databases: Making Compilation, Vectorization, and Prefetching Work Together At Last. PVLDB, 11(1):1--13, 2017.

Digital Library

[23]

T. Neumann. Efficiently Compiling Efficient Query Plans for Modern Hardware. PVLDB, 4(9):539--550, 2011.

Digital Library

[24]

K. Nguyen, L. Fang, C. Navasca, G. Xu, B. Demsky, and S. Lu. Skyway: Connecting Managed Heaps in Distributed Big Data Systems. In ASPLOS, pages 56--69, 2018.

Digital Library

[25]

Oracle. About Java Flight Recorder, 2019. https://docs.oracle.com/javacomponents/jmc-5-4/jfr-runtime-guide/about.htm#JFRUH170.

[26]

Oracle. com.oracle.truffle.api (GraalVM Truffle Java API Reference), 2019. https://www.graalvm.org/truffle/javadoc/com/oracle/truffle/api/package-summary.html.

[27]

Oracle. Java Native Interface Specification Contents, 2019. https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/jniTOC.html.

[28]

Oracle RDBMS. Database --- Cloud Database --- Oracle, 2019. https://www.oracle.com/it/database/.

[29]

S. Palkar, F. Abuzaid, P. Bailis, and M. Zaharia. Filter Before You Parse: Faster Analytics on Raw Data with Sparser. PVLDB, 11(11):1576--1589, 2018.

Digital Library

[30]

T. Rompf and M. Odersky. Lightweight Modular Staging: A Pragmatic Approach to Runtime Code Generation and Compiled DSLs. In GPCE, pages 127--136, 2010.

Digital Library

[31]

A. Stojanov, I. Toskov, T. Rompf, and M. Puschel. SIMD Intrinsics on Managed Language Runtimes. In CGO 2018, pages 2--15, 2018.

[32]

Team Apache Hadoop. Apache Hadoop, 2019. https://hadoop.apache.org/.

[33]

Team Apache Hadoop. Apache Hadoop 2.9.2; The YARN Timeline Service v.2, 2019. http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html.

[34]

Team Apache Spark. ExperimentalMethods (Spark 2.4.0 JavaDoc), 2019. https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/ExperimentalMethods.html#extraOptimizations().

[35]

Team Apache Spark. spark/filters.scala at v2.4.0 apache/spark, 2019. https://github.com/apache/spark/blob/v2.4.0/sql/core/src/main/scala/org/apache/spark/sql/sources/filters.scala.

[36]

Team Apache Spark. Tuning - Spark 2.4.0 Documentation, 2019. https://spark.apache.org/docs/latest/tuning.html#data-locality.

[37]

Team Databricks. Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale, 2019. https://databricks.com/session/spark-sql-adaptive-execution-unleashes-the-power-of-cluster-in-large-scale.

[38]

TeamMapDB. MapDB, 2019. http://www.mapdb.org/.

[39]

Team Parquet. Apache Parquet, 2019. https://parquet.apache.org/.

[40]

Team PrestoDB. Presto --- Distributed SQL Query Engine for Big Data, 2019. http://prestodb.github.io/.

[41]

TPC. TPC-H - Homepage, 2019. http://www.tpc.org/tpch/.

[42]

S. Wanderman-Milne and N. Li. Runtime Code Generation in Cloudera Impala. IEEE Data Eng. Bull., pages 31--37, 2014.

[43]

C. Wimmer and T. Würthinger. Truffle: A Self-optimizing Runtime System. In SPLASH, pages 13--14, 2012.

Digital Library

[44]

T. Würthinger, C. Wimmer, C. Humer, A. Wöß, L. Stadler, C. Seaton, G. Duboscq, D. Simon, and M. Grimmer. Practical Partial Evaluation for High-performance Dynamic Language Runtimes. SIGPLAN Not., pages 662--676, 2017.

Digital Library

[45]

T. Würthinger, C. Wimmer, A. Wöß, L. Stadler, G. Duboscq, C. Humer, G. Richards, D. Simon, and M. Wolczko. One VM to Rule Them All. In Onward!, pages 187--204, 2013.

[46]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In NSDI, pages 2--2, 2012.

Digital Library

[47]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with Working Sets. In HotCloud, pages 10--10, 2010.

Digital Library

Cited By

Mahdi MWalshe RFarrell SPandit H(2024)Comprehensive Review and Future Research Directions on ICT StandardisationInformation10.3390/info1511069115:11(691)Online publication date: 2-Nov-2024
https://doi.org/10.3390/info15110691
Schiavio FBonetta DBinder W(2023)DynQ: a dynamic query engine with query-reuse capabilities embedded in a polyglot runtimeThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00784-232:5(1111-1135)Online publication date: 13-Mar-2023
https://dl.acm.org/doi/10.1007/s00778-023-00784-2
Zhang XYang C(2022)Intelligent Fashion Design Platform of Clothing Colleges in the Guangdong-Hong Kong-Macao Greater Bay Area based on JAVA and SQL2022 4th International Conference on Smart Systems and Inventive Technology (ICSSIT)10.1109/ICSSIT53264.2022.9716570(1211-1215)Online publication date: 20-Jan-2022
https://doi.org/10.1109/ICSSIT53264.2022.9716570
Show More Cited By

Dynamic speculative optimizations for SQL compilation in Apache Spark
1. Software and its engineering
  1. Software notations and tools

Recommendations

Towards dynamic SQL compilation in Apache Spark
Programming '20: Companion Proceedings of the 4th International Conference on Art, Science, and Engineering of Programming

Big-data systems have gained significant momentum, and Apache Spark is becoming a de-facto standard for modern data analytics. Spark relies on code generation to optimize the execution performance of SQL queries on a variety of data sources. Despite its ...
Spark SQL: Relational Data Processing in Spark
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. ...
Learning Apache Spark 2.0

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 13, Issue 5

January 2020

195 pages

ISSN:2150-8097

Editors:
Magdalena Balazinska
University of Washington
,
Xiaofang Zhou
University of Queensland, Australia

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 January 2020

Published in PVLDB Volume 13, Issue 5

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
325
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)1

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mahdi MWalshe RFarrell SPandit H(2024)Comprehensive Review and Future Research Directions on ICT StandardisationInformation10.3390/info1511069115:11(691)Online publication date: 2-Nov-2024
https://doi.org/10.3390/info15110691
Schiavio FBonetta DBinder W(2023)DynQ: a dynamic query engine with query-reuse capabilities embedded in a polyglot runtimeThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00784-232:5(1111-1135)Online publication date: 13-Mar-2023
https://dl.acm.org/doi/10.1007/s00778-023-00784-2
Zhang XYang C(2022)Intelligent Fashion Design Platform of Clothing Colleges in the Guangdong-Hong Kong-Macao Greater Bay Area based on JAVA and SQL2022 4th International Conference on Smart Systems and Inventive Technology (ICSSIT)10.1109/ICSSIT53264.2022.9716570(1211-1215)Online publication date: 20-Jan-2022
https://doi.org/10.1109/ICSSIT53264.2022.9716570
Schiavio FBonetta DBinder W(2021)Language-agnostic integrated queries in a managed polyglot runtimeProceedings of the VLDB Endowment10.14778/3457390.345740514:8(1414-1426)Online publication date: 1-Apr-2021
https://dl.acm.org/doi/10.14778/3457390.3457405
Zhang WKim JRoss KSedlar EStadler L(2021)Adaptive code generation for data-intensive analyticsProceedings of the VLDB Endowment10.14778/3447689.344769714:6(929-942)Online publication date: 1-Feb-2021
https://dl.acm.org/doi/10.14778/3447689.3447697
Callan JPetke J(2021)Optimising SQL Queries Using Genetic Improvement2021 IEEE/ACM International Workshop on Genetic Improvement (GI)10.1109/GI52543.2021.00010(9-10)Online publication date: May-2021
https://doi.org/10.1109/GI52543.2021.00010
Menon PNgom AMa LMowry TPavlo A(2020)Permutable compiled queriesProceedings of the VLDB Endowment10.14778/3425879.342588214:2(101-113)Online publication date: 1-Oct-2020
https://dl.acm.org/doi/10.14778/3425879.3425882
Schiavio FBonetta DBinder WAguiar AChiba SBoix E(2020)Towards dynamic SQL compilation in Apache SparkCompanion Proceedings of the 4th International Conference on Art, Science, and Engineering of Programming10.1145/3397537.3397566(46-49)Online publication date: 23-Mar-2020
https://dl.acm.org/doi/10.1145/3397537.3397566

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents