research-article

Public Access

Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

Authors:

Jesús Camacho-Rodríguez,

Michael J. Mior,

Daniel LemireAuthors Info & Claims

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Pages 221 - 230

https://doi.org/10.1145/3183713.3190662

Published: 27 May 2018 Publication History

Abstract

Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. The goal of this paper is to formally introduce Calcite to the broader research community, brie y present its history, and describe its architecture, features, functionality, and patterns for adoption. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of processing a variety of query languages, an adapter architecture designed for extensibility, and support for heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This exible, embeddable, and extensible architecture is what makes Calcite an attractive choice for adoption in big-data frameworks. It is an active project that continues to introduce support for the new types of data sources, query languages, and approaches to query processing and optimization.

References

[1]

Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2003. The CQL Continuous Query Language: Semantic Foundations and Query Execution. Technical Report 2003--67. Stanford InfoLab.

[2]

Michael Armbrust et almbox. 2015 a. Spark SQL: Relational Data Processing in Spark Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, New York, NY, USA, 1383--1394.

Digital Library

[3]

Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015 b. Spark SQL: Relational Data Processing in Spark. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, New York, NY, USA, 1383--1394.

Digital Library

[4]

ASF The Apache Software Foundation. (Nov. 2017). Retrieved November 20, 2017 from http://www.apache.org/

[5]

Vinayak Borkar, Yingyi Bu, E. Preston Carman, Jr., Nicola Onose, Till Westmann, Pouria Pirzadeh, Michael J. Carey, and Vassilis J. Tsotras. 2015. Algebricks: A Data Model-agnostic Compiler Backend for Big Data Languages Proceedings of the Sixth ACM Symposium on Cloud Computing (SoCC '15). ACM, New York, NY, USA, 422--433.

Digital Library

[6]

M. J. Carey et almbox. 1995. Towards heterogeneous multimedia information systems: the Garlic approach IDE-DOM '95. 124--131.

Digital Library

[7]

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. 2006. Bigtable: A Distributed Storage System for Structured Data 7th Symposium on Operating Systems Design and Implementation (OSDI '06), November 6-8, Seattle, WA, USA. 205--218.

Digital Library

[8]

Surajit Chaudhuri, Ravi Krishnamurthy, Spyros Potamianos, and Kyuseok Shim . 1995. Optimizing Queries with Materialized Views. In Proceedings of the Eleventh International Conference on Data Engineering (ICDE '95). IEEE Computer Society, Washington, DC, USA, 190--200.

Digital Library

[9]

E. F. Codd. 1970. A Relational Model of Data for Large Shared Data Banks. Commun. ACM, Vol. 13, 6 (June. 1970), 377--387. ibinfopersonAndrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis . 2010. Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB, Vol. 3, 1 (2010), 330--339. http://www.comp.nus.edu.sg/ vldb2010/proceedings/files/papers/R29.pdf

Digital Library

[10]

Marcelo RN Mendes, Pedro Bizarro, and Paulo Marques. 2009. A performance study of event processing systems. Technology Conference on Performance Evaluation and Benchmarking. Springer, 221--236.

Digital Library

[11]

Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig Latin: a not-so-foreign language for data processing SIGMOD.

Digital Library

[12]

Kian Win Ong, Yannis Papakonstantinou, and Romain Vernoux. 2014. The SQL

[13]

query language: Configurable, unifying and semi-structured. arXiv preprint arXiv:1405.3631 (2014).

[14]

Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun C. Murthy, and Carlo Curino. 2015. Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31-June 4, 2015. 1357--1369.

Digital Library

[15]

Mohamed A. Soliman, Lyublena Antova, Venkatesh Raghavan, Amr El-Helw, Zhongxian Gu, Entong Shen, George C. Caragea, Carlos Garcia-Alvarado, Foyzur Rahman, Michalis Petropoulos, Florian Waas, Sivaramakrishnan Narayanan, Konstantinos Krikellas, and Rhonda Baldwin. 2014. Orca: A Modular Query Optimizer Architecture for Big Data Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD '14). ACM, New York, NY, USA, 337--348.

Digital Library

[16]

Michael Stonebraker and Ugur cCetintemel. 2005. ''One size fits all'': an idea whose time has come and gone 21st International Conference on Data Engineering (ICDE'05). IEEE Computer Society, Washington, DC, USA, 2--11. showISSN1063-6382

Digital Library

[17]

Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: a warehousing solution over a map-reduce framework. VLDB (2009), 1626--1629.

Digital Library

[18]

Immanuel Trummer and Christoph Koch. 2017. Multi-objective parametric query optimization. The VLDB Journal, Vol. 26, 1 (2017), 107--124.

Digital Library

[19]

Ashwin Kumar Vajantri, Kunwar Deep Singh Toor, and Edmon Begoli. 2017. An Apache Calcite-based Polystore Variation for Federated Querying of Heterogeneous Healthcare Sources. In 2nd Workshop on Methods to Manage Heterogeneous Big Data and Polystore Databases. IEEE Computer Society, Washington, DC, USA.

[20]

Katherine Yu, Vijay Gadepally, and Michael Stonebraker. 2017. Database engine integration and performance analysis of the BigDAWG polystore system 2017 IEEE High Performance Extreme Computing Conference (HPEC). IEEE Computer Society, Washington, DC, USA, 1--7.

[21]

Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In HotCloud.

Digital Library

[22]

Jingren Zhou, Per-Åke Larson, and Ronnie Chaiken. 2010. Incorporating partitioning and parallel plans into the SCOPE optimizer 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010) IEEE Computer Society, Washington, DC, USA, 1060--1071.

Cited By

Santhosh Gourishetti (2024)Performance Optimization in Distributed SQL Environments : A Comprehensive Analysis of Presto Query EngineInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT2410617310:6(241-253)Online publication date: 8-Nov-2024
https://doi.org/10.32628/CSEIT24106173
Wang SPan SCheung A(2024)QED: A Powerful Query Equivalence Decider for SQLProceedings of the VLDB Endowment10.14778/3681954.368202417:11(3602-3614)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682024
Srivastava TFernandez R(2024)Saving Money for Analytical Workloads in the CloudProceedings of the VLDB Endowment10.14778/3681954.368201817:11(3524-3537)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682018
Show More Cited By

Index Terms

Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. DBMS engine architectures

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

May 2018

1874 pages

ISBN:9781450347037

DOI:10.1145/3183713

General Chairs:
Gautam Das
University of Texas at Arlington, USA
,
Christopher Jermaine
Rice University, USA
,
Philip Bernstein
Microsoft Research, USA

Copyright © 2018 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

U.S. Department of Energy

Conference

SIGMOD/PODS '18

Sponsor:

SIGMOD

SIGMOD/PODS '18: International Conference on Management of Data

June 10 - 15, 2018

TX, Houston, USA

Acceptance Rates

SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

91
Total Citations
View Citations
3,867
Total Downloads

Downloads (Last 12 months)890
Downloads (Last 6 weeks)159

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Santhosh Gourishetti (2024)Performance Optimization in Distributed SQL Environments : A Comprehensive Analysis of Presto Query EngineInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT2410617310:6(241-253)Online publication date: 8-Nov-2024
https://doi.org/10.32628/CSEIT24106173
Wang SPan SCheung A(2024)QED: A Powerful Query Equivalence Decider for SQLProceedings of the VLDB Endowment10.14778/3681954.368202417:11(3602-3614)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682024
Srivastava TFernandez R(2024)Saving Money for Analytical Workloads in the CloudProceedings of the VLDB Endowment10.14778/3681954.368201817:11(3524-3537)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682018
Zhang WLim WButrovich MPavlo A(2024)The Holon Approach for Simultaneously Tuning Multiple Components in a Self-Driving Database Management System with Machine Learning via Synthesized Proto-ActionsProceedings of the VLDB Endowment10.14778/3681954.368200717:11(3373-3387)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3682007
Zhou XLi GSun ZLiu ZChen WWu JLiu JFeng RZeng G(2024)D-Bot: Database Diagnosis System using Large Language ModelsProceedings of the VLDB Endowment10.14778/3675034.367504317:10(2514-2527)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.14778/3675034.3675043
Poepsel-Lemaitre RBeedkar KMarkl V(2024)Disclosure-Compliant Query AnsweringProceedings of the ACM on Management of Data10.1145/36988082:6(1-28)Online publication date: 20-Dec-2024
https://dl.acm.org/doi/10.1145/3698808
Hu YGilad AStephens-Martinez KRoy SYang J(2024)Qr-Hint: Actionable Hints Towards Correcting Wrong SQL QueriesProceedings of the ACM on Management of Data10.1145/36549952:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654995
Siddiqui TWu W(2024)ML-Powered Index Tuning: An Overview of Recent Progress and Open ChallengesACM SIGMOD Record10.1145/3641832.364183652:4(19-30)Online publication date: 19-Jan-2024
https://dl.acm.org/doi/10.1145/3641832.3641836
Huang HSiddiqui TAlotaibi RCurino CLeeka JJindal AZhao JCamacho-Rodríguez JTian Y(2024)Sibyl: Forecasting Time-Evolving Query WorkloadsProceedings of the ACM on Management of Data10.1145/36393082:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639308
Pradhan AKarthik SS. R(2024)Optimal Query Plans for Geo-distributed Data Analytics at ScaleProceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD)10.1145/3632410.3632424(247-251)Online publication date: 4-Jan-2024
https://dl.acm.org/doi/10.1145/3632410.3632424
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents