Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3183713.3190662acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

Published: 27 May 2018 Publication History

Abstract

Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. The goal of this paper is to formally introduce Calcite to the broader research community, brie y present its history, and describe its architecture, features, functionality, and patterns for adoption. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of processing a variety of query languages, an adapter architecture designed for extensibility, and support for heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This exible, embeddable, and extensible architecture is what makes Calcite an attractive choice for adoption in big-data frameworks. It is an active project that continues to introduce support for the new types of data sources, query languages, and approaches to query processing and optimization.

References

[1]
Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2003. The CQL Continuous Query Language: Semantic Foundations and Query Execution. Technical Report 2003--67. Stanford InfoLab.
[2]
Michael Armbrust et almbox. 2015 a. Spark SQL: Relational Data Processing in Spark Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, New York, NY, USA, 1383--1394.
[3]
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015 b. Spark SQL: Relational Data Processing in Spark. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, New York, NY, USA, 1383--1394.
[4]
ASF The Apache Software Foundation. (Nov. 2017). Retrieved November 20, 2017 from http://www.apache.org/
[5]
Vinayak Borkar, Yingyi Bu, E. Preston Carman, Jr., Nicola Onose, Till Westmann, Pouria Pirzadeh, Michael J. Carey, and Vassilis J. Tsotras. 2015. Algebricks: A Data Model-agnostic Compiler Backend for Big Data Languages Proceedings of the Sixth ACM Symposium on Cloud Computing (SoCC '15). ACM, New York, NY, USA, 422--433.
[6]
M. J. Carey et almbox. 1995. Towards heterogeneous multimedia information systems: the Garlic approach IDE-DOM '95. 124--131.
[7]
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. 2006. Bigtable: A Distributed Storage System for Structured Data 7th Symposium on Operating Systems Design and Implementation (OSDI '06), November 6-8, Seattle, WA, USA. 205--218.
[8]
Surajit Chaudhuri, Ravi Krishnamurthy, Spyros Potamianos, and Kyuseok Shim . 1995. Optimizing Queries with Materialized Views. In Proceedings of the Eleventh International Conference on Data Engineering (ICDE '95). IEEE Computer Society, Washington, DC, USA, 190--200.
[9]
E. F. Codd. 1970. A Relational Model of Data for Large Shared Data Banks. Commun. ACM, Vol. 13, 6 (June. 1970), 377--387. ibinfopersonAndrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis . 2010. Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB, Vol. 3, 1 (2010), 330--339. http://www.comp.nus.edu.sg/ vldb2010/proceedings/files/papers/R29.pdf
[10]
Marcelo RN Mendes, Pedro Bizarro, and Paulo Marques. 2009. A performance study of event processing systems. Technology Conference on Performance Evaluation and Benchmarking. Springer, 221--236.
[11]
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig Latin: a not-so-foreign language for data processing SIGMOD.
[12]
Kian Win Ong, Yannis Papakonstantinou, and Romain Vernoux. 2014. The SQL
[13]
query language: Configurable, unifying and semi-structured. arXiv preprint arXiv:1405.3631 (2014).
[14]
Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun C. Murthy, and Carlo Curino. 2015. Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31-June 4, 2015. 1357--1369.
[15]
Mohamed A. Soliman, Lyublena Antova, Venkatesh Raghavan, Amr El-Helw, Zhongxian Gu, Entong Shen, George C. Caragea, Carlos Garcia-Alvarado, Foyzur Rahman, Michalis Petropoulos, Florian Waas, Sivaramakrishnan Narayanan, Konstantinos Krikellas, and Rhonda Baldwin. 2014. Orca: A Modular Query Optimizer Architecture for Big Data Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD '14). ACM, New York, NY, USA, 337--348.
[16]
Michael Stonebraker and Ugur cCetintemel. 2005. ''One size fits all'': an idea whose time has come and gone 21st International Conference on Data Engineering (ICDE'05). IEEE Computer Society, Washington, DC, USA, 2--11. showISSN1063-6382
[17]
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: a warehousing solution over a map-reduce framework. VLDB (2009), 1626--1629.
[18]
Immanuel Trummer and Christoph Koch. 2017. Multi-objective parametric query optimization. The VLDB Journal, Vol. 26, 1 (2017), 107--124.
[19]
Ashwin Kumar Vajantri, Kunwar Deep Singh Toor, and Edmon Begoli. 2017. An Apache Calcite-based Polystore Variation for Federated Querying of Heterogeneous Healthcare Sources. In 2nd Workshop on Methods to Manage Heterogeneous Big Data and Polystore Databases. IEEE Computer Society, Washington, DC, USA.
[20]
Katherine Yu, Vijay Gadepally, and Michael Stonebraker. 2017. Database engine integration and performance analysis of the BigDAWG polystore system 2017 IEEE High Performance Extreme Computing Conference (HPEC). IEEE Computer Society, Washington, DC, USA, 1--7.
[21]
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In HotCloud.
[22]
Jingren Zhou, Per-Åke Larson, and Ronnie Chaiken. 2010. Incorporating partitioning and parallel plans into the SCOPE optimizer 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010) IEEE Computer Society, Washington, DC, USA, 1060--1071.

Cited By

View all
  • (2024)Performance Optimization in Distributed SQL Environments : A Comprehensive Analysis of Presto Query EngineInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT2410617310:6(241-253)Online publication date: 8-Nov-2024
  • (2024)QED: A Powerful Query Equivalence Decider for SQLProceedings of the VLDB Endowment10.14778/3681954.368202417:11(3602-3614)Online publication date: 1-Jul-2024
  • (2024)Saving Money for Analytical Workloads in the CloudProceedings of the VLDB Endowment10.14778/3681954.368201817:11(3524-3537)Online publication date: 1-Jul-2024
  • Show More Cited By

Index Terms

  1. Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
    May 2018
    1874 pages
    ISBN:9781450347037
    DOI:10.1145/3183713
    Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 May 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. apache calcite
    2. data management
    3. modular query optimization
    4. query algebra
    5. relational semantics
    6. storage adapters

    Qualifiers

    • Research-article

    Funding Sources

    • U.S. Department of Energy

    Conference

    SIGMOD/PODS '18
    Sponsor:

    Acceptance Rates

    SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)890
    • Downloads (Last 6 weeks)159
    Reflects downloads up to 25 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Performance Optimization in Distributed SQL Environments : A Comprehensive Analysis of Presto Query EngineInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT2410617310:6(241-253)Online publication date: 8-Nov-2024
    • (2024)QED: A Powerful Query Equivalence Decider for SQLProceedings of the VLDB Endowment10.14778/3681954.368202417:11(3602-3614)Online publication date: 1-Jul-2024
    • (2024)Saving Money for Analytical Workloads in the CloudProceedings of the VLDB Endowment10.14778/3681954.368201817:11(3524-3537)Online publication date: 1-Jul-2024
    • (2024)The Holon Approach for Simultaneously Tuning Multiple Components in a Self-Driving Database Management System with Machine Learning via Synthesized Proto-ActionsProceedings of the VLDB Endowment10.14778/3681954.368200717:11(3373-3387)Online publication date: 30-Aug-2024
    • (2024)D-Bot: Database Diagnosis System using Large Language ModelsProceedings of the VLDB Endowment10.14778/3675034.367504317:10(2514-2527)Online publication date: 1-Jun-2024
    • (2024)Disclosure-Compliant Query AnsweringProceedings of the ACM on Management of Data10.1145/36988082:6(1-28)Online publication date: 20-Dec-2024
    • (2024)Qr-Hint: Actionable Hints Towards Correcting Wrong SQL QueriesProceedings of the ACM on Management of Data10.1145/36549952:3(1-27)Online publication date: 30-May-2024
    • (2024)ML-Powered Index Tuning: An Overview of Recent Progress and Open ChallengesACM SIGMOD Record10.1145/3641832.364183652:4(19-30)Online publication date: 19-Jan-2024
    • (2024)Sibyl: Forecasting Time-Evolving Query WorkloadsProceedings of the ACM on Management of Data10.1145/36393082:1(1-27)Online publication date: 26-Mar-2024
    • (2024)Optimal Query Plans for Geo-distributed Data Analytics at ScaleProceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD)10.1145/3632410.3632424(247-251)Online publication date: 4-Jan-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media