Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Dynamic speculative optimizations for SQL compilation in Apache Spark

Published: 01 January 2020 Publication History

Abstract

Big-data systems have gained significant momentum, and Apache Spark is becoming a de-facto standard for modern data analytics. Spark relies on SQL query compilation to optimize the execution performance of analytical workloads on a variety of data sources. Despite its scalable architecture, Spark's SQL code generation suffers from significant runtime overheads related to data access and de-serialization. Such performance penalty can be significant, especially when applications operate on human-readable data formats such as CSV or JSON.
In this paper we present a new approach to query compilation that overcomes these limitations by relying on run-time profiling and dynamic code generation. Our new SQL compiler for Spark produces highly-efficient machine code, leading to speedups of up to 4.4x on the TPC-H benchmark with textual-form data formats such as CSV or JSON.

References

[1]
M. Armbrust, T. Das, A. Davidson, A. Ghodsi, A. Or, J. Rosen, I. Stoica, P. Wendell, R. Xin, and M. Zaharia. Scaling Spark in the Real World: Performance and Usability. PVLDB, 8(12):1840--1843, 2015.
[2]
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark SQL: Relational Data Processing in Spark. In SIGMOD, pages 1383--1394, 2015.
[3]
D. Bonetta and M. Brantner. FAD.js: Fast JSON Data Access Using JIT-based Speculative Optimizations. PVLDB, 10(12):1778--1789, 2017.
[4]
P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache Flink: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull., pages 28--38, 2015.
[5]
J. Chen, S. Jindel, R. Walzer, R. Sen, N. Jimsheleishvilli, and M. Andrews. The MemSQL Query Optimizer: A modern optimizer for real-time analytics in a distributed database. PVLDB, 9(13):1401--1412, 2016.
[6]
T. Chiba and T. Onodera. Workload characterization and optimization of TPC-H queries on Apache Spark. In ISPASS, pages 112--121, 2016.
[7]
Databricks. Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop, 2019. https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html.
[8]
Databricks. Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale, 2019. https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html.
[9]
G. Duboscq, T. Würthinger, L. Stadler, C. Wimmer, D. Simon, and H. Mössenböck. An Intermediate Representation for Speculative Optimizations in a Dynamic Compiler. In VMIL, pages 1--10, 2013.
[10]
G. M. Essertel, R. Y. Tahboub, J. M. Decker, K. J. Brown, K. Olukotun, and T. Rompf. Flare: Optimizing Apache Spark with Native Compilation for Scale-up Architectures and Medium-size Data. In OSDI, pages 799--815, 2018.
[11]
ETH DCO. Welcome - Data Center Observatory --- ETH Zurich, 2019. https://wiki.dco.ethz.ch/.
[12]
H. Fang. Managing data lakes in big data era: What's a data lake and why has it became popular in data management ecosystem. In CYBER, pages 820--824, 2015.
[13]
Y. Futamura. Partial Computation of Programs. RIMS Symposia on Software Science and Engineering, 1983.
[14]
N. A. Halli, H.-P. Charles, and J.-F. Méhaut. Performance comparison between Java and JNI for optimal implementation of computational micro-kernels. CoRR, abs/1412.6765, 2014.
[15]
K. Karanasos, A. Balmin, M. Kutsch, F. Ozcan, V. Ercegovac, C. Xia, and J. Jackson. Dynamically Optimizing Queries over Large Scale Data Platforms. In SIGMOD, pages 943--954, 2014.
[16]
M. Karpathiotakis, I. Alagiannis, T. Heinis, M. Branco, and A. Ailamaki. Just-In-Time Data Virtualization: Lightweight Data Management with ViDa. CIDR, 2015.
[17]
A. Kashuba and H. Mühleisen. Automatic Generation of a Hybrid Query Execution Engine. CoRR, 2018.
[18]
T. Kersten, V. Leis, A. Kemper, T. Neumann, A. Pavlo, and P. Boncz. Everything You Always Wanted to Know About Compiled and Vectorized Queries but Were Afraid to Ask. PVLDB, 11(13):2209--2222, 2018.
[19]
A. Kohn, V. Leis, and T. Neumann. Adaptive Execution of Compiled Queries. In ICDE, pages 197--208, 2018.
[20]
Y. Li, N. R. Katsipoulakis, B. Chandramouli, J. Goldstein, and D. Kossmann. Mison: A Fast JSON Parser for Data Analytics. PVLDB, 10(10):1118--1129, 2017.
[21]
Y. Li, M. Li, L. Ding, and M. Interlandi. RIOS: Runtime Integrated Optimizer for Spark. In SoCC, pages 275--287, 2018.
[22]
P. Menon, T. C. Mowry, and A. Pavlo. Relaxed Operator Fusion for In-Memory Databases: Making Compilation, Vectorization, and Prefetching Work Together At Last. PVLDB, 11(1):1--13, 2017.
[23]
T. Neumann. Efficiently Compiling Efficient Query Plans for Modern Hardware. PVLDB, 4(9):539--550, 2011.
[24]
K. Nguyen, L. Fang, C. Navasca, G. Xu, B. Demsky, and S. Lu. Skyway: Connecting Managed Heaps in Distributed Big Data Systems. In ASPLOS, pages 56--69, 2018.
[25]
Oracle. About Java Flight Recorder, 2019. https://docs.oracle.com/javacomponents/jmc-5-4/jfr-runtime-guide/about.htm#JFRUH170.
[26]
Oracle. com.oracle.truffle.api (GraalVM Truffle Java API Reference), 2019. https://www.graalvm.org/truffle/javadoc/com/oracle/truffle/api/package-summary.html.
[27]
Oracle. Java Native Interface Specification Contents, 2019. https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/jniTOC.html.
[28]
Oracle RDBMS. Database --- Cloud Database --- Oracle, 2019. https://www.oracle.com/it/database/.
[29]
S. Palkar, F. Abuzaid, P. Bailis, and M. Zaharia. Filter Before You Parse: Faster Analytics on Raw Data with Sparser. PVLDB, 11(11):1576--1589, 2018.
[30]
T. Rompf and M. Odersky. Lightweight Modular Staging: A Pragmatic Approach to Runtime Code Generation and Compiled DSLs. In GPCE, pages 127--136, 2010.
[31]
A. Stojanov, I. Toskov, T. Rompf, and M. Puschel. SIMD Intrinsics on Managed Language Runtimes. In CGO 2018, pages 2--15, 2018.
[32]
Team Apache Hadoop. Apache Hadoop, 2019. https://hadoop.apache.org/.
[33]
Team Apache Hadoop. Apache Hadoop 2.9.2; The YARN Timeline Service v.2, 2019. http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html.
[34]
Team Apache Spark. ExperimentalMethods (Spark 2.4.0 JavaDoc), 2019. https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/ExperimentalMethods.html#extraOptimizations().
[35]
Team Apache Spark. spark/filters.scala at v2.4.0 apache/spark, 2019. https://github.com/apache/spark/blob/v2.4.0/sql/core/src/main/scala/org/apache/spark/sql/sources/filters.scala.
[36]
Team Apache Spark. Tuning - Spark 2.4.0 Documentation, 2019. https://spark.apache.org/docs/latest/tuning.html#data-locality.
[37]
Team Databricks. Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale, 2019. https://databricks.com/session/spark-sql-adaptive-execution-unleashes-the-power-of-cluster-in-large-scale.
[38]
TeamMapDB. MapDB, 2019. http://www.mapdb.org/.
[39]
Team Parquet. Apache Parquet, 2019. https://parquet.apache.org/.
[40]
Team PrestoDB. Presto --- Distributed SQL Query Engine for Big Data, 2019. http://prestodb.github.io/.
[41]
TPC. TPC-H - Homepage, 2019. http://www.tpc.org/tpch/.
[42]
S. Wanderman-Milne and N. Li. Runtime Code Generation in Cloudera Impala. IEEE Data Eng. Bull., pages 31--37, 2014.
[43]
C. Wimmer and T. Würthinger. Truffle: A Self-optimizing Runtime System. In SPLASH, pages 13--14, 2012.
[44]
T. Würthinger, C. Wimmer, C. Humer, A. Wöß, L. Stadler, C. Seaton, G. Duboscq, D. Simon, and M. Grimmer. Practical Partial Evaluation for High-performance Dynamic Language Runtimes. SIGPLAN Not., pages 662--676, 2017.
[45]
T. Würthinger, C. Wimmer, A. Wöß, L. Stadler, G. Duboscq, C. Humer, G. Richards, D. Simon, and M. Wolczko. One VM to Rule Them All. In Onward!, pages 187--204, 2013.
[46]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In NSDI, pages 2--2, 2012.
[47]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with Working Sets. In HotCloud, pages 10--10, 2010.

Cited By

View all
  • (2024)Comprehensive Review and Future Research Directions on ICT StandardisationInformation10.3390/info1511069115:11(691)Online publication date: 2-Nov-2024
  • (2023)DynQ: a dynamic query engine with query-reuse capabilities embedded in a polyglot runtimeThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00784-232:5(1111-1135)Online publication date: 13-Mar-2023
  • (2022)Intelligent Fashion Design Platform of Clothing Colleges in the Guangdong-Hong Kong-Macao Greater Bay Area based on JAVA and SQL2022 4th International Conference on Smart Systems and Inventive Technology (ICSSIT)10.1109/ICSSIT53264.2022.9716570(1211-1215)Online publication date: 20-Jan-2022
  • Show More Cited By
  1. Dynamic speculative optimizations for SQL compilation in Apache Spark

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 13, Issue 5
    January 2020
    195 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 January 2020
    Published in PVLDB Volume 13, Issue 5

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)15
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 18 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Comprehensive Review and Future Research Directions on ICT StandardisationInformation10.3390/info1511069115:11(691)Online publication date: 2-Nov-2024
    • (2023)DynQ: a dynamic query engine with query-reuse capabilities embedded in a polyglot runtimeThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00784-232:5(1111-1135)Online publication date: 13-Mar-2023
    • (2022)Intelligent Fashion Design Platform of Clothing Colleges in the Guangdong-Hong Kong-Macao Greater Bay Area based on JAVA and SQL2022 4th International Conference on Smart Systems and Inventive Technology (ICSSIT)10.1109/ICSSIT53264.2022.9716570(1211-1215)Online publication date: 20-Jan-2022
    • (2021)Language-agnostic integrated queries in a managed polyglot runtimeProceedings of the VLDB Endowment10.14778/3457390.345740514:8(1414-1426)Online publication date: 1-Apr-2021
    • (2021)Adaptive code generation for data-intensive analyticsProceedings of the VLDB Endowment10.14778/3447689.344769714:6(929-942)Online publication date: 1-Feb-2021
    • (2021)Optimising SQL Queries Using Genetic Improvement2021 IEEE/ACM International Workshop on Genetic Improvement (GI)10.1109/GI52543.2021.00010(9-10)Online publication date: May-2021
    • (2020)Permutable compiled queriesProceedings of the VLDB Endowment10.14778/3425879.342588214:2(101-113)Online publication date: 1-Oct-2020
    • (2020)Towards dynamic SQL compilation in Apache SparkCompanion Proceedings of the 4th International Conference on Art, Science, and Engineering of Programming10.1145/3397537.3397566(46-49)Online publication date: 23-Mar-2020

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media