research-article

GPU Database Systems Characterization and Optimization

Authors:

Matteo Interlandi,

Hyesoon KimAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 17, Issue 3

Pages 441 - 454

https://doi.org/10.14778/3632093.3632107

Published: 01 November 2023 Publication History

Abstract

GPUs offer massive parallelism and high-bandwidth memory access, making them an attractive option for accelerating data analytics in database systems. However, while modern GPUs possess more resources than ever before (e.g., higher DRAM bandwidth), efficient system implementations and judicious resource allocations for query processing are still necessary for optimal performance. Database systems can save GPU runtime costs through just-enough resource allocation or improve query throughput with concurrent query processing by leveraging new GPU resource-allocation capabilities, such as Multi-Instance GPU (MIG).

In this paper, we do a cross-stack performance and resource-utilization analysis of four GPU database systems, including Crystal (the state-of-the-art GPU database, performance-wise) and TQP (the latest entry in the GPU database space). We evaluate the bottlenecks of each system through an in-depth microarchitectural study and identify resource underutilization by leveraging the classic roofline model. Based on the insights gained from our investigation, we propose optimizations for both system implementation and resource allocation, using which we are able to achieve 1.9x lower latency for single-query execution and up to 6.5x throughput improvement for concurrent query execution.

References

[1]

2017. PCIe 4.0 specification finally out with 16 GT/s on tap. [Online] Available from: https://techreport.com/news/32064/pcie-4-0-specification-finally-out-with-16-gts-on-tap/.

[2]

2019. PCI-SIG Achieves 32GT/s with New PCI Express 5.0 Specification. [Online] Available from: https://www.businesswire.com/news/home/20190529005766/en/PCI-SIG%C2%AE-Achieves-32GTs-with-New-PCI-Express%C2%AE-5.0-Specification.

[3]

2022. PCI-SIG Announces PCI Express 7.0 Specification to Reach 128 GT/s. [Online] Available from: https://www.businesswire.com/news/home/20220621005137/en.

[4]

2022. PCI-SIG Releases PCIe 6.0 Specification Delivering Record Performance to Power Big Data Applications. [Online] Available from: https://www.businesswire.com/news/home/20220111005011/en/PCI-SIG%C2%AE-Releases-PCIe%C2%AE-6.0-Specification-Delivering-Record-Performance-to-Power-Big-Data-Applications.

[5]

Anastassia Ailamaki, David J DeWitt, Mark D Hill, and David A Wood. 1999. DBMSs On A Modern Processor: Where Does Time Go? PVLDB (1999).

[6]

Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In SIGMOD. 1383--1394.

[7]

Rob Armstrong, Arthy Sundaram, and Fred Oh. 2021. Revealing New Features in the CUDA 11.5 Toolkit. [Online] Available from: https://developer.nvidia.com/blog/revealing-new-features-in-the-cuda-11-5-toolkit/.

[8]

Yuki Asada, Victor Fu, Apurva Gandhi, Advitya Gemawat, Lihao Zhang, Dong He, Vivek Gupta, Ehi Nosakhare, Dalitso Banda, Rathijit Sen, and Matteo Interlandi. 2022. Share the tensor tea: how databases can leverage the machine learning ecosystem. PVLDB (2022), 3598--3601.

Digital Library

[9]

Sara S Baghsorkhi, Matthieu Delahaye, Sanjay J Patel, William D Gropp, and Wen-mei W Hwu. 2010. An Adaptive Performance Modeling Tool for GPU Architectures. In PPoPP. 10.

[10]

Peter Bakkum and Srimat Chakradhar. 2010. Efficient Data Management for GPU Databases. https://github.com/bakks/virginian/.

[11]

BlazingSQL. 2021. BlazingSQL. https://github.com/BlazingDB/blazingsql.

[12]

Nils Boeschen and Carsten Binnig. 2022. GaccO - A GPU-accelerated OLTP DBMS. In SIGMOD. 1003--1016.

[13]

Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anastasia Ailamaki. 2019. HetExchange: Encapsulating Heterogeneous CPU-GPU Parallelism in JIT Compiled Engines. PVLDB (2019), 544--556.

Digital Library

[14]

Nan Ding and Samuel Williams. 2019. An Instruction Roofline Model for GPUs. In PBMS. 7--18.

[15]

Harish Doraiswamy and Juliana Freire. 2020. A GPU-friendly Geometric Data Model and Algebra for Spatial Queries. In SIGMOD. 1875--1885.

[16]

Sofoklis Floratos, Mengbai Xiao, Hao Wang, Chengxin Guo, Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2021. NestGPU: Nested Query Processing on GPU. In ICDE. 1008--1019.

[17]

Henning Funke, Sebastian Breß, Stefan Noll, Volker Markl, and Jens Teubner. 2018. Pipelined Query Processing in Coprocessor Environments. In SIGMOD. 1603--1618.

[18]

Henning Funke and Jens Teubner. 2020. Data-Parallel Query Processing on Non-Uniform Data. PVLDB (2020), 884--897.

[19]

Emily Furst, Mark Oskin, and Bill Howe. 2017. Profiling a GPU database implementation: a holistic view of GPU resource utilization on TPC-H queries. In DaMON. 1--6.

[20]

Apurva Gandhi, Yuki Asada, Victor Fu, Advitya Gemawat, Lihao Zhang, Rathijit Sen, Carlo Curino, Jesús Camacho-Rodríguez, and Matteo Interlandi. 2022. The Tensor Data Platform: Towards an AI-centric Database System. In CIDR.

[21]

Dong He, Supun C Nakandala, Dalitso Banda, Rathijit Sen, Karla Saur, Kwanghyun Park, Carlo Curino, Jesús Camacho-Rodríguez, Konstantinos Karanasos, and Matteo Interlandi. 2022. Query Processing on Tensor Computation Runtimes. PVLDB (2022), 2811--2825.

[22]

HeavyDB. 2022. HeavyDB. https://github.com/heavyai/heavydb.

[23]

Max Heimel, Michael Saecker, Holger Pirk, Stefan Manegold, and Volker Markl. 2013. Hardware-Oblivious Parallelism for in-Memory Column-Stores. PVLDB (2013), 709--720.

[24]

Mark Hill and Vijay Janapa Reddi. 2019. Gables: A Roofline Model for Mobile SoCs. In HPCA. 317--330.

[25]

Sunpyo Hong and Hyesoon Kim. 2009. An Analytical Model for a GPU Architecture with Memory-Level and Thread-Level Parallelism Awareness. In ISCA.

[26]

Aleksandar Ilic, Frederico Pratas, and Leonel Sousa. 2014. Cache-Aware Roofline Model: Upgrading the Loft. IEEE CAL (2014), 21--24.

Digital Library

[27]

Anders Friis Kaas, Stilyan Petrov Paleykov, Ties Robroek, and Pinar Toziin. 2022. Deep Learning Training on Multi-Instance GPUs. arXiv:2209.06018 [cs.LG]

[28]

KaiGai Kohei. 2022. PG-Strom. https://github.com/heterodb/pg-strom.

[29]

Alexander Krolik, Clark Verbrugge, and Laurie Hendren. 2021. R3d3: Optimized Query Compilation on GPUs. In CGO. 277--288.

[30]

Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. In CGO. 75--88.

Digital Library

[31]

Jing Li, Hung-Wei Tseng, Chunbin Lin, Yannis Papakonstantinou, and Steven Swanson. 2016. HippogriffDB: Balancing I/O and GPU Bandwidth in Big Data Analytics. PVLDB (2016), 1647--1658.

[32]

André Lopes, Frederico Pratas, Leonel Sousa, and Aleksandar Ilic. 2017. Exploring GPU Performance, Power and Energy-Efficiency Bounds with Cache-aware Roofline Modeling. In ISPASS. 259--268.

[33]

Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2020. Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects. In SIGMOD. 1633--1649.

[34]

Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2022. Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast Interconnects. In SIGMOD. 1017--1032.

[35]

Tobias Maltenberger, Ivan Ilic, Ilin Tolovski, and Tilmann Rabl. 2022. Evaluating Multi-GPU Sorting with Modern Interconnects. In SIGMOD. 1795--1809.

[36]

Lei Mao. 2021. Math-Bound VS Memory-Bound Operations. https://leimao.github.io/blog/Math-Bound-VS-Memory-Bound-Operations/.

[37]

Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for Modern Hardware. PVLDB (2011), 539--550.

[38]

NVIDIA. 2016. nvidia-smi Documentation. [Online] Available from: https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf.

[39]

NVIDIA. 2020. NVIDIA A100 TENSOR CORE GPU Unprecedented Acceleration at Every Scale. [Online] Available from: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-nvidia-us-2188504-web.pdf.

[40]

NVIDIA. 2021. NVIDIA Multi-Process Service Introduction. [Online] Available from: https://docs.nvidia.com/deploy/mps/index.html.

[41]

NVIDIA. 2022. NVIDIA Multi-Instance GPU. [Online] Available from: https://www.nvidia.com/en-us/technologies/multi-instance-gpu/.

[42]

NVIDIA. 2022. NVIDIA Multi-Instance GPU User Guide. [Online] Available from: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/.

[43]

NVIDIA. 2022. NVIDIA NSight Systems User Guide. [Online] Available from: https://docs.nvidia.com/nsight-systems/UserGuide/index.html.

[44]

NVIDIA. 2022. Parallel Thread Execution ISA Version 7.8. [Online] Available from: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html.

[45]

NVIDIA. 2022. Thrust. [Online] Availble from: https://docs.nvidia.com/cuda/thrust/index.html.

[46]

Georg Ofenbeck, Ruedi Steinmann, Victoria Caparros, Daniele G. Spampinato, and Markus Püschel. 2014. Applying the Roofline Model. In ISPASS. 76--85.

[47]

Patrick O'Neil, Elizabeth O'Neil, Xuedong Chen, and Stephen Revilak. 2009. The Star Schema Benchmark and Augmented Fact Table Indexing. In TPCTC. 237--252.

[48]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS. 8024--8035.

Digital Library

[49]

Johns Paul, Bingsheng He, Shengliang Lu, and Chiew Tong Lau. 2020. Improving Execution Efficiency of Just-in-Time Compilation Based Query Processing on GPUs. PVLDB (2020), 202--214.

[50]

Johns Paul, Jiong He, and Bingsheng He. 2016. GPL: A GPU-based Pipelined Query Processing Engine. In SIGMOD. 1935--1950.

Digital Library

[51]

Johns Paul, Shengliang Lu, Bingsheng He, and Chiew Tong Lau. 2021. MG-Join: A Scalable Join for Massively Parallel Multi-GPU Architectures. In SIGMOD. 1413--1425.

[52]

Tilmann Rabl, Meikel Poess, Hans-Arno Jacobsen, Patrick O'Neil, and Elizabeth O'Neil. 2013. Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance. In ICPE. 361.

[53]

Viktor Rosenfeld, Sebastian Breß, and Volker Markl. 2023. Query Processing on Heterogeneous CPU/GPU Systems. Comput. Surveys (2023), 1--38.

[54]

Rathijit Sen and Karthik Ramachandra. 2018. Characterizing resource sensitivity of database workloads. In HPCA. 657--669.

[55]

Rathijit Sen and Yuanyuan Tian. 2023. Microarchitectural Analysis of Graph BI Queries on RDBMS. In DaMoN. 102--106.

[56]

Anil Shanbhag. 2020. Crystal GPU Library. https://github.com/anilshanbhag/crystal.

[57]

Anil Shanbhag, Samuel Madden, and Xiangyao Yu. 2020. A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics. In SIGMOD. 1617--1632.

[58]

Anil Shanbhag, Bobbi W. Yogatama, Xiangyao Yu, and Samuel Madden. 2022. Tile-Based Lightweight Integer Compression in GPU. In SIGMOD. 1390--1403.

[59]

Jian Shen, Ze Wang, David Wang, Jeremy Shi, and Steven Chen. 2019. AresDB. https://github.com/uber/aresdb.

[60]

P. Sioulas, P. Chrysogelos, M. Karpathiotakis, R. Appuswamy, and A. Ailamaki. 2019. Hardware-Conscious Hash-Joins on GPUs. In ICDE. 698--709.

[61]

Utku Sirin and Anastasia Ailamaki. 2020. Micro-Architectural Analysis of OLAP: Limitations and Opportunities. PVLDB (2020), 840--853.

Digital Library

[62]

Young-Kyoon Suh, Junyoung An, Byungchul Tak, and Gap-Joo Na. 2022. A Comprehensive Empirical Study of Query Performance Across GPU DBMSes. SIGMETRICS (2022), 1--29.

[63]

Cheng Tan, Zhichao Li, Jian Zhang, Yu Cao, Sikai Qi, Zherui Liu, Yibo Zhu, and Chuanxiong Guo. 2021. Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem. arXiv:2109.11067 [cs.DC]

[64]

RAPIDS Development Team. 2018. RAPIDS: Collection of Libraries for End to End GPU Data Science. https://rapids.ai

[65]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun. ACM (2009), 65--76.

Digital Library

[66]

Gene Wu, Joseph L. Greathouse, Alexander Lyashevsky, Nuwan Jayasena, and Derek Chiou. 2015. GPGPU Performance and Power Estimation Using Machine Learning. In HPCA. 564--576.

[67]

Haicheng Wu, Gregory Diamos, Tim Sheard, Molham Aref, Sean Baxter, Michael Garland, and Sudhakar Yalamanchili. 2014. Red Fox: An Execution Environment for Relational Query Processing on GPUs. In CGO. 44--54.

[68]

Bobbi W. Yogatama, Weiwei Gong, and Xiangyao Yu. 2022. Orchestrating Data Placement and Query Execution in Heterogeneous CPU-GPU DBMS. PVLDB (2022), 2491--2503.

[69]

Fuxun Yu, Di Wang, Longfei Shangguan, Minjia Zhang, Chenchen Liu, and Xiang Chen. 2022. A Survey of Multi-Tenant Deep Learning Inference on GPU. arXiv:2203.09040 [cs.DC]

[70]

Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2013. The Yin and Yang of Processing Data Warehousing Queries on GPU Devices. PVLDB (2013), 817--828.

[71]

Yao Zhang and John D. Owens. 2011. A Quantitative Performance Analysis Model for GPU Architectures. In HCPA. 382--393.

Cited By

Afroozeh AFelius LBoncz P(2024)Accelerating GPU Data Processing using FastLanes CompressionProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663450(1-11)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663450
Deng YChen SHong ZTang B(2024)How Does Software Prefetching Work on GPU Query Processing?Proceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663445(1-9)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663445
Kroviakov AKurapov PAnneser CGiceva J(2024)Heterogeneous Intra-Pipeline Device-Parallel AggregationsProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663441(1-10)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663441

Index Terms

GPU Database Systems Characterization and Optimization

Index terms have been assigned to the content through auto-classification.

Recommendations

Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing
CLUSTER '10: Proceedings of the 2010 IEEE International Conference on Cluster Computing

In this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is ...
Optimization and Implementation of LBM Benchmark on Multithreaded GPU
DSDE '10: Proceedings of the 2010 International Conference on Data Storage and Data Engineering

With fast development of transistor technology, Graphic Processing Unit(GPU) is increasingly used in the non-graphics applications, and major GPU hardware vendors have introduced software stacks for their own GPUs, such as Brook+ for AMD GPU. Compared ...
GPU-accelerated string matching for database applications

Implementations of relational operators on GPU processors have resulted in order of magnitude speedups compared to their multicore CPU counterparts. Here we focus on the efficient implementation of string matching operators common in SQL queries. Due to ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 17, Issue 3

November 2023

353 pages

ISSN:2150-8097

Editors:
Meihui Zhang
Beijing Institute of Technology
,
Cyrus Shahabi
University of Southern California

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 November 2023

Published in PVLDB Volume 17, Issue 3

Check for updates

Badges

Artifacts Available / v1.1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
682
Total Downloads

Downloads (Last 12 months)663
Downloads (Last 6 weeks)48

Reflects downloads up to 29 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Afroozeh AFelius LBoncz P(2024)Accelerating GPU Data Processing using FastLanes CompressionProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663450(1-11)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663450
Deng YChen SHong ZTang B(2024)How Does Software Prefetching Work on GPU Query Processing?Proceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663445(1-9)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663445
Kroviakov AKurapov PAnneser CGiceva J(2024)Heterogeneous Intra-Pipeline Device-Parallel AggregationsProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663441(1-10)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663441
Xue HWang SZhao XHao HWang XJiang K(2024)FPGA Based Database Sort-Aggregation Query Acceleration Architecture2024 9th International Symposium on Computer and Information Processing Technology (ISCIPT)10.1109/ISCIPT61983.2024.10672642(204-208)Online publication date: 24-May-2024
https://doi.org/10.1109/ISCIPT61983.2024.10672642

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents