Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3555041.3589735acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
short-paper

Efficient Query Processing in Python Using Compilation

Published: 05 June 2023 Publication History

Abstract

In this paper, we present a framework for efficient query processing in Python. Inspired by the increasing interest in Python-based frameworks such as TensorFlow and Pandas for data scientists, our framework consists of three different input languages. The first language is SQL; to better integrate the SQL queries with the rest of the data science pipeline, by relying on off-the-shelf query optimizers (e.g., PostgreSQL) the SQL code is translated to a physical query plan, which is in turn translated to Pandas code. The second input is Pandas code; it can be either run by Pandas itself or alternatively be translated into SDQL.py, the third input language that can be translated into efficient low-level code and can achieve an order-of-magnitude performance improvement over Pandas. Our framework exposes a Python-based API that allows data scientists to use SDQL.py as a pure Python library.

Supplemental Material

MP4 File
In this video, we present the demo paper "Efficient Query Processing in Python Using Compilation." It covers the demonstration of converting SQL code to Pandas, SDQL.py, and ultimately C++. We also show how the previous transformations improve query processing efficiency and execution speed.

References

[1]
Tim Fischer, Denis Hirn, and Torsten Grust. 2022. Snakes on a Plan: Compiling Python Functions into Plain SQL Queries (SIGMOD'22). 2389--2392.
[2]
Stefan Hagedorn, Steffen Kl"abe, and Kai-Uwe Sattler. 2021. Putting Pandas in a Box. In CIDR.
[3]
Alekh Jindal et al. 2021. Magpie: Python at speed and scale using cloud backends. In CIDR.
[4]
Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo, and Peter Boncz. 2018. Everything you always wanted to know about compiled and vectorized queries but were afraid to ask. PVLDB (2018).
[5]
Paul Mooney. 2020. 2020 Kaggle Machine Learning & Data Science Survey. https://kaggle.com/competitions/kaggle-survey-2020
[6]
Thomas Neumann. 2011. Efficiently compiling efficient query plans for modern hardware. Proceedings of the VLDB Endowment, Vol. 4, 9 (2011), 539--550.
[7]
Shoumik Palkar, James J Thomas, Anil Shanbhag, Deepak Narayanan, Holger Pirk, Malte Schwarzkopf, Saman Amarasinghe, and Matei Zaharia. 2017. Weld: A common runtime for high performance data analytics. (2017).
[8]
Chuck Pheatt. 2008. Intel® threading building blocks. Journal of Computing Sciences in Colleges, Vol. 23, 4 (2008), 298--298.
[9]
Gregory Popovitch. 2022. The Parallel Hashmap. https://github.com/greg7mdp/parallel-hashmap.
[10]
Mark Raasveldt and Hannes Mühleisen. 2019. DuckDB: an embeddable analytical database. In Proceedings of the 2019 International Conference on Management of Data. 1981--1984.
[11]
Hesam Shahrokhi and Amir Shaikhha. 2023. Building a Compiled Query Engine in Python. In Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction. 180--190.
[12]
Amir Shaikhha et al. 2016. How to Architect a Query Compiler (SIGMOD'16). ACM, New York, NY, USA, 1907--1922.
[13]
Amir Shaikhha et al. 2022. Functional collection programming with semi-ring dictionaries. Proc. of the ACM on Prog. Lang., Vol. 6, OOPSLA1 (2022), 1--33.
[14]
Amir Shaikhha, Mohammad Dashti, and Christoph Koch. 2018a. Push versus Pull-Based Loop Fusion in Query Engines. JFP, Vol. 28 (2018), e10.
[15]
Amir Shaikhha, Yannis Klonatos, and Christoph Koch. 2018b. Building Efficient Query Engines in a High-Level Language. TODS, Vol. 43, 1, Article 4 (April 2018).
[16]
Leonhard Spiegelberg et al. 2021. Tuplex: Data science in Python at native code speed (SIGMOD'21). 1718--1731.

Cited By

View all
  • (2024)HiPy: Extracting High-Level Semantics from Python Code for Data ProcessingProceedings of the ACM on Programming Languages10.1145/36897378:OOPSLA2(736-762)Online publication date: 8-Oct-2024
  • (2024)QFusor: A UDF Optimizer Plugin for SQL Databases2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00427(5457-5460)Online publication date: 13-May-2024
  • (2024)PyTond: Efficient Python Data Science on the Shoulders of Databases2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00039(423-435)Online publication date: 13-May-2024

Index Terms

  1. Efficient Query Processing in Python Using Compilation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '23: Companion of the 2023 International Conference on Management of Data
    June 2023
    330 pages
    ISBN:9781450395076
    DOI:10.1145/3555041
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 June 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. SQL
    2. data science
    3. dataframe
    4. python
    5. query compilation

    Qualifiers

    • Short-paper

    Data Availability

    In this video, we present the demo paper "Efficient Query Processing in Python Using Compilation." It covers the demonstration of converting SQL code to Pandas, SDQL.py, and ultimately C++. We also show how the previous transformations improve query processing efficiency and execution speed. https://dl.acm.org/doi/10.1145/3555041.3589735#SIGMOD23-modde72.mp4

    Conference

    SIGMOD/PODS '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)96
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 25 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)HiPy: Extracting High-Level Semantics from Python Code for Data ProcessingProceedings of the ACM on Programming Languages10.1145/36897378:OOPSLA2(736-762)Online publication date: 8-Oct-2024
    • (2024)QFusor: A UDF Optimizer Plugin for SQL Databases2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00427(5457-5460)Online publication date: 13-May-2024
    • (2024)PyTond: Efficient Python Data Science on the Shoulders of Databases2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00039(423-435)Online publication date: 13-May-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media