short-paper

Efficient Query Processing in Python Using Compilation

Authors:

Hesam Shahrokhi,

Callum Groeger,

Amir ShaikhhaAuthors Info & Claims

SIGMOD '23: Companion of the 2023 International Conference on Management of Data

Pages 199 - 202

https://doi.org/10.1145/3555041.3589735

Published: 05 June 2023 Publication History

Abstract

In this paper, we present a framework for efficient query processing in Python. Inspired by the increasing interest in Python-based frameworks such as TensorFlow and Pandas for data scientists, our framework consists of three different input languages. The first language is SQL; to better integrate the SQL queries with the rest of the data science pipeline, by relying on off-the-shelf query optimizers (e.g., PostgreSQL) the SQL code is translated to a physical query plan, which is in turn translated to Pandas code. The second input is Pandas code; it can be either run by Pandas itself or alternatively be translated into SDQL.py, the third input language that can be translated into efficient low-level code and can achieve an order-of-magnitude performance improvement over Pandas. Our framework exposes a Python-based API that allows data scientists to use SDQL.py as a pure Python library.

Supplemental Material

MP4 File

In this video, we present the demo paper "Efficient Query Processing in Python Using Compilation." It covers the demonstration of converting SQL code to Pandas, SDQL.py, and ultimately C++. We also show how the previous transformations improve query processing efficiency and execution speed.

Download
26.73 MB

References

[1]

Tim Fischer, Denis Hirn, and Torsten Grust. 2022. Snakes on a Plan: Compiling Python Functions into Plain SQL Queries (SIGMOD'22). 2389--2392.

[2]

Stefan Hagedorn, Steffen Kl"abe, and Kai-Uwe Sattler. 2021. Putting Pandas in a Box. In CIDR.

[3]

Alekh Jindal et al. 2021. Magpie: Python at speed and scale using cloud backends. In CIDR.

[4]

Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo, and Peter Boncz. 2018. Everything you always wanted to know about compiled and vectorized queries but were afraid to ask. PVLDB (2018).

[5]

Paul Mooney. 2020. 2020 Kaggle Machine Learning & Data Science Survey. https://kaggle.com/competitions/kaggle-survey-2020

[6]

Thomas Neumann. 2011. Efficiently compiling efficient query plans for modern hardware. Proceedings of the VLDB Endowment, Vol. 4, 9 (2011), 539--550.

Digital Library

[7]

Shoumik Palkar, James J Thomas, Anil Shanbhag, Deepak Narayanan, Holger Pirk, Malte Schwarzkopf, Saman Amarasinghe, and Matei Zaharia. 2017. Weld: A common runtime for high performance data analytics. (2017).

[8]

Chuck Pheatt. 2008. Intel® threading building blocks. Journal of Computing Sciences in Colleges, Vol. 23, 4 (2008), 298--298.

Digital Library

[9]

Gregory Popovitch. 2022. The Parallel Hashmap. https://github.com/greg7mdp/parallel-hashmap.

[10]

Mark Raasveldt and Hannes Mühleisen. 2019. DuckDB: an embeddable analytical database. In Proceedings of the 2019 International Conference on Management of Data. 1981--1984.

Digital Library

[11]

Hesam Shahrokhi and Amir Shaikhha. 2023. Building a Compiled Query Engine in Python. In Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction. 180--190.

Digital Library

[12]

Amir Shaikhha et al. 2016. How to Architect a Query Compiler (SIGMOD'16). ACM, New York, NY, USA, 1907--1922.

[13]

Amir Shaikhha et al. 2022. Functional collection programming with semi-ring dictionaries. Proc. of the ACM on Prog. Lang., Vol. 6, OOPSLA1 (2022), 1--33.

[14]

Amir Shaikhha, Mohammad Dashti, and Christoph Koch. 2018a. Push versus Pull-Based Loop Fusion in Query Engines. JFP, Vol. 28 (2018), e10.

[15]

Amir Shaikhha, Yannis Klonatos, and Christoph Koch. 2018b. Building Efficient Query Engines in a High-Level Language. TODS, Vol. 43, 1, Article 4 (April 2018).

Digital Library

[16]

Leonhard Spiegelberg et al. 2021. Tuplex: Data science in Python at native code speed (SIGMOD'21). 1718--1731.

Cited By

Jungmair MEngelke AGiceva J(2024)HiPy: Extracting High-Level Semantics from Python Code for Data ProcessingProceedings of the ACM on Programming Languages10.1145/36897378:OOPSLA2(736-762)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689737
Chasialis KPalaiologou TFoufoulas YSimitsis AIoannidis Y(2024)QFusor: A UDF Optimizer Plugin for SQL Databases2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00427(5457-5460)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00427
Shahrokhi HKaboli AGhorbani MShaikhha A(2024)PyTond: Efficient Python Data Science on the Shoulders of Databases2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00039(423-435)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00039

Index Terms

Efficient Query Processing in Python Using Compilation
1. Information systems
  1. Data management systems
    1. Query languages
      1. Relational database query languages

Recommendations

Building a Compiled Query Engine in Python
CC 2023: Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction

The simplicity of Python and its rich set of libraries has made it the most popular language for data science. Moreover, the interpreted nature of Python offers an easy debugging experience for the developers. However, it comes with the price of poor ...
Compile-Time Analysis of Compiler Frameworks for Query Compilation
CGO '24: Proceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization

Low compilation times are highly important in contexts of Just-in-time compilation. This not only applies to language runtimes for Java, WebAssembly, or JavaScript, but is also crucial for database systems that employ query compilation as the primary ...
Programming with Python

Python is a powerful, easy-to-use programming language based on other traditional languages. In this article, the author discusses why his company, NovaSys Health, chose Python as its primary programming language. The author notes that Python is ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '23: Companion of the 2023 International Conference on Management of Data

June 2023

330 pages

ISBN:9781450395076

DOI:10.1145/3555041

General Chairs:
Sudipto Das
Amazon Web Services, USA
,
Ippokratis Pandis
Amazon Web Services, USA
,
Program Chairs:
K. Selçuk Candan
Arizona State University, USA
,
Sihem Amer-Yahia
CNRS, Université Grenoble Alpes, France

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 June 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Data Availability

In this video, we present the demo paper "Efficient Query Processing in Python Using Compilation." It covers the demonstration of converting SQL code to Pandas, SDQL.py, and ultimately C++. We also show how the previous transformations improve query processing efficiency and execution speed. https://dl.acm.org/doi/10.1145/3555041.3589735#SIGMOD23-modde72.mp4

Conference

SIGMOD/PODS '23

Sponsor:

SIGMOD

SIGMOD/PODS '23: International Conference on Management of Data

June 18 - 23, 2023

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
236
Total Downloads

Downloads (Last 12 months)96
Downloads (Last 6 weeks)8

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jungmair MEngelke AGiceva J(2024)HiPy: Extracting High-Level Semantics from Python Code for Data ProcessingProceedings of the ACM on Programming Languages10.1145/36897378:OOPSLA2(736-762)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689737
Chasialis KPalaiologou TFoufoulas YSimitsis AIoannidis Y(2024)QFusor: A UDF Optimizer Plugin for SQL Databases2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00427(5457-5460)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00427
Shahrokhi HKaboli AGhorbani MShaikhha A(2024)PyTond: Efficient Python Data Science on the Shoulders of Databases2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00039(423-435)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00039

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten