Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

GPU Database Systems Characterization and Optimization

Published: 01 November 2023 Publication History

Abstract

GPUs offer massive parallelism and high-bandwidth memory access, making them an attractive option for accelerating data analytics in database systems. However, while modern GPUs possess more resources than ever before (e.g., higher DRAM bandwidth), efficient system implementations and judicious resource allocations for query processing are still necessary for optimal performance. Database systems can save GPU runtime costs through just-enough resource allocation or improve query throughput with concurrent query processing by leveraging new GPU resource-allocation capabilities, such as Multi-Instance GPU (MIG).
In this paper, we do a cross-stack performance and resource-utilization analysis of four GPU database systems, including Crystal (the state-of-the-art GPU database, performance-wise) and TQP (the latest entry in the GPU database space). We evaluate the bottlenecks of each system through an in-depth microarchitectural study and identify resource underutilization by leveraging the classic roofline model. Based on the insights gained from our investigation, we propose optimizations for both system implementation and resource allocation, using which we are able to achieve 1.9x lower latency for single-query execution and up to 6.5x throughput improvement for concurrent query execution.

References

[1]
2017. PCIe 4.0 specification finally out with 16 GT/s on tap. [Online] Available from: https://techreport.com/news/32064/pcie-4-0-specification-finally-out-with-16-gts-on-tap/.
[2]
2019. PCI-SIG Achieves 32GT/s with New PCI Express 5.0 Specification. [Online] Available from: https://www.businesswire.com/news/home/20190529005766/en/PCI-SIG%C2%AE-Achieves-32GTs-with-New-PCI-Express%C2%AE-5.0-Specification.
[3]
2022. PCI-SIG Announces PCI Express 7.0 Specification to Reach 128 GT/s. [Online] Available from: https://www.businesswire.com/news/home/20220621005137/en.
[4]
2022. PCI-SIG Releases PCIe 6.0 Specification Delivering Record Performance to Power Big Data Applications. [Online] Available from: https://www.businesswire.com/news/home/20220111005011/en/PCI-SIG%C2%AE-Releases-PCIe%C2%AE-6.0-Specification-Delivering-Record-Performance-to-Power-Big-Data-Applications.
[5]
Anastassia Ailamaki, David J DeWitt, Mark D Hill, and David A Wood. 1999. DBMSs On A Modern Processor: Where Does Time Go? PVLDB (1999).
[6]
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In SIGMOD. 1383--1394.
[7]
Rob Armstrong, Arthy Sundaram, and Fred Oh. 2021. Revealing New Features in the CUDA 11.5 Toolkit. [Online] Available from: https://developer.nvidia.com/blog/revealing-new-features-in-the-cuda-11-5-toolkit/.
[8]
Yuki Asada, Victor Fu, Apurva Gandhi, Advitya Gemawat, Lihao Zhang, Dong He, Vivek Gupta, Ehi Nosakhare, Dalitso Banda, Rathijit Sen, and Matteo Interlandi. 2022. Share the tensor tea: how databases can leverage the machine learning ecosystem. PVLDB (2022), 3598--3601.
[9]
Sara S Baghsorkhi, Matthieu Delahaye, Sanjay J Patel, William D Gropp, and Wen-mei W Hwu. 2010. An Adaptive Performance Modeling Tool for GPU Architectures. In PPoPP. 10.
[10]
Peter Bakkum and Srimat Chakradhar. 2010. Efficient Data Management for GPU Databases. https://github.com/bakks/virginian/.
[11]
BlazingSQL. 2021. BlazingSQL. https://github.com/BlazingDB/blazingsql.
[12]
Nils Boeschen and Carsten Binnig. 2022. GaccO - A GPU-accelerated OLTP DBMS. In SIGMOD. 1003--1016.
[13]
Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anastasia Ailamaki. 2019. HetExchange: Encapsulating Heterogeneous CPU-GPU Parallelism in JIT Compiled Engines. PVLDB (2019), 544--556.
[14]
Nan Ding and Samuel Williams. 2019. An Instruction Roofline Model for GPUs. In PBMS. 7--18.
[15]
Harish Doraiswamy and Juliana Freire. 2020. A GPU-friendly Geometric Data Model and Algebra for Spatial Queries. In SIGMOD. 1875--1885.
[16]
Sofoklis Floratos, Mengbai Xiao, Hao Wang, Chengxin Guo, Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2021. NestGPU: Nested Query Processing on GPU. In ICDE. 1008--1019.
[17]
Henning Funke, Sebastian Breß, Stefan Noll, Volker Markl, and Jens Teubner. 2018. Pipelined Query Processing in Coprocessor Environments. In SIGMOD. 1603--1618.
[18]
Henning Funke and Jens Teubner. 2020. Data-Parallel Query Processing on Non-Uniform Data. PVLDB (2020), 884--897.
[19]
Emily Furst, Mark Oskin, and Bill Howe. 2017. Profiling a GPU database implementation: a holistic view of GPU resource utilization on TPC-H queries. In DaMON. 1--6.
[20]
Apurva Gandhi, Yuki Asada, Victor Fu, Advitya Gemawat, Lihao Zhang, Rathijit Sen, Carlo Curino, Jesús Camacho-Rodríguez, and Matteo Interlandi. 2022. The Tensor Data Platform: Towards an AI-centric Database System. In CIDR.
[21]
Dong He, Supun C Nakandala, Dalitso Banda, Rathijit Sen, Karla Saur, Kwanghyun Park, Carlo Curino, Jesús Camacho-Rodríguez, Konstantinos Karanasos, and Matteo Interlandi. 2022. Query Processing on Tensor Computation Runtimes. PVLDB (2022), 2811--2825.
[22]
HeavyDB. 2022. HeavyDB. https://github.com/heavyai/heavydb.
[23]
Max Heimel, Michael Saecker, Holger Pirk, Stefan Manegold, and Volker Markl. 2013. Hardware-Oblivious Parallelism for in-Memory Column-Stores. PVLDB (2013), 709--720.
[24]
Mark Hill and Vijay Janapa Reddi. 2019. Gables: A Roofline Model for Mobile SoCs. In HPCA. 317--330.
[25]
Sunpyo Hong and Hyesoon Kim. 2009. An Analytical Model for a GPU Architecture with Memory-Level and Thread-Level Parallelism Awareness. In ISCA.
[26]
Aleksandar Ilic, Frederico Pratas, and Leonel Sousa. 2014. Cache-Aware Roofline Model: Upgrading the Loft. IEEE CAL (2014), 21--24.
[27]
Anders Friis Kaas, Stilyan Petrov Paleykov, Ties Robroek, and Pinar Toziin. 2022. Deep Learning Training on Multi-Instance GPUs. arXiv:2209.06018 [cs.LG]
[28]
KaiGai Kohei. 2022. PG-Strom. https://github.com/heterodb/pg-strom.
[29]
Alexander Krolik, Clark Verbrugge, and Laurie Hendren. 2021. R3d3: Optimized Query Compilation on GPUs. In CGO. 277--288.
[30]
Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. In CGO. 75--88.
[31]
Jing Li, Hung-Wei Tseng, Chunbin Lin, Yannis Papakonstantinou, and Steven Swanson. 2016. HippogriffDB: Balancing I/O and GPU Bandwidth in Big Data Analytics. PVLDB (2016), 1647--1658.
[32]
André Lopes, Frederico Pratas, Leonel Sousa, and Aleksandar Ilic. 2017. Exploring GPU Performance, Power and Energy-Efficiency Bounds with Cache-aware Roofline Modeling. In ISPASS. 259--268.
[33]
Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2020. Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects. In SIGMOD. 1633--1649.
[34]
Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2022. Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast Interconnects. In SIGMOD. 1017--1032.
[35]
Tobias Maltenberger, Ivan Ilic, Ilin Tolovski, and Tilmann Rabl. 2022. Evaluating Multi-GPU Sorting with Modern Interconnects. In SIGMOD. 1795--1809.
[36]
Lei Mao. 2021. Math-Bound VS Memory-Bound Operations. https://leimao.github.io/blog/Math-Bound-VS-Memory-Bound-Operations/.
[37]
Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for Modern Hardware. PVLDB (2011), 539--550.
[38]
NVIDIA. 2016. nvidia-smi Documentation. [Online] Available from: https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf.
[39]
NVIDIA. 2020. NVIDIA A100 TENSOR CORE GPU Unprecedented Acceleration at Every Scale. [Online] Available from: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-nvidia-us-2188504-web.pdf.
[40]
NVIDIA. 2021. NVIDIA Multi-Process Service Introduction. [Online] Available from: https://docs.nvidia.com/deploy/mps/index.html.
[41]
NVIDIA. 2022. NVIDIA Multi-Instance GPU. [Online] Available from: https://www.nvidia.com/en-us/technologies/multi-instance-gpu/.
[42]
NVIDIA. 2022. NVIDIA Multi-Instance GPU User Guide. [Online] Available from: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/.
[43]
NVIDIA. 2022. NVIDIA NSight Systems User Guide. [Online] Available from: https://docs.nvidia.com/nsight-systems/UserGuide/index.html.
[44]
NVIDIA. 2022. Parallel Thread Execution ISA Version 7.8. [Online] Available from: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html.
[45]
NVIDIA. 2022. Thrust. [Online] Availble from: https://docs.nvidia.com/cuda/thrust/index.html.
[46]
Georg Ofenbeck, Ruedi Steinmann, Victoria Caparros, Daniele G. Spampinato, and Markus Püschel. 2014. Applying the Roofline Model. In ISPASS. 76--85.
[47]
Patrick O'Neil, Elizabeth O'Neil, Xuedong Chen, and Stephen Revilak. 2009. The Star Schema Benchmark and Augmented Fact Table Indexing. In TPCTC. 237--252.
[48]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS. 8024--8035.
[49]
Johns Paul, Bingsheng He, Shengliang Lu, and Chiew Tong Lau. 2020. Improving Execution Efficiency of Just-in-Time Compilation Based Query Processing on GPUs. PVLDB (2020), 202--214.
[50]
Johns Paul, Jiong He, and Bingsheng He. 2016. GPL: A GPU-based Pipelined Query Processing Engine. In SIGMOD. 1935--1950.
[51]
Johns Paul, Shengliang Lu, Bingsheng He, and Chiew Tong Lau. 2021. MG-Join: A Scalable Join for Massively Parallel Multi-GPU Architectures. In SIGMOD. 1413--1425.
[52]
Tilmann Rabl, Meikel Poess, Hans-Arno Jacobsen, Patrick O'Neil, and Elizabeth O'Neil. 2013. Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance. In ICPE. 361.
[53]
Viktor Rosenfeld, Sebastian Breß, and Volker Markl. 2023. Query Processing on Heterogeneous CPU/GPU Systems. Comput. Surveys (2023), 1--38.
[54]
Rathijit Sen and Karthik Ramachandra. 2018. Characterizing resource sensitivity of database workloads. In HPCA. 657--669.
[55]
Rathijit Sen and Yuanyuan Tian. 2023. Microarchitectural Analysis of Graph BI Queries on RDBMS. In DaMoN. 102--106.
[56]
Anil Shanbhag. 2020. Crystal GPU Library. https://github.com/anilshanbhag/crystal.
[57]
Anil Shanbhag, Samuel Madden, and Xiangyao Yu. 2020. A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics. In SIGMOD. 1617--1632.
[58]
Anil Shanbhag, Bobbi W. Yogatama, Xiangyao Yu, and Samuel Madden. 2022. Tile-Based Lightweight Integer Compression in GPU. In SIGMOD. 1390--1403.
[59]
Jian Shen, Ze Wang, David Wang, Jeremy Shi, and Steven Chen. 2019. AresDB. https://github.com/uber/aresdb.
[60]
P. Sioulas, P. Chrysogelos, M. Karpathiotakis, R. Appuswamy, and A. Ailamaki. 2019. Hardware-Conscious Hash-Joins on GPUs. In ICDE. 698--709.
[61]
Utku Sirin and Anastasia Ailamaki. 2020. Micro-Architectural Analysis of OLAP: Limitations and Opportunities. PVLDB (2020), 840--853.
[62]
Young-Kyoon Suh, Junyoung An, Byungchul Tak, and Gap-Joo Na. 2022. A Comprehensive Empirical Study of Query Performance Across GPU DBMSes. SIGMETRICS (2022), 1--29.
[63]
Cheng Tan, Zhichao Li, Jian Zhang, Yu Cao, Sikai Qi, Zherui Liu, Yibo Zhu, and Chuanxiong Guo. 2021. Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem. arXiv:2109.11067 [cs.DC]
[64]
RAPIDS Development Team. 2018. RAPIDS: Collection of Libraries for End to End GPU Data Science. https://rapids.ai
[65]
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun. ACM (2009), 65--76.
[66]
Gene Wu, Joseph L. Greathouse, Alexander Lyashevsky, Nuwan Jayasena, and Derek Chiou. 2015. GPGPU Performance and Power Estimation Using Machine Learning. In HPCA. 564--576.
[67]
Haicheng Wu, Gregory Diamos, Tim Sheard, Molham Aref, Sean Baxter, Michael Garland, and Sudhakar Yalamanchili. 2014. Red Fox: An Execution Environment for Relational Query Processing on GPUs. In CGO. 44--54.
[68]
Bobbi W. Yogatama, Weiwei Gong, and Xiangyao Yu. 2022. Orchestrating Data Placement and Query Execution in Heterogeneous CPU-GPU DBMS. PVLDB (2022), 2491--2503.
[69]
Fuxun Yu, Di Wang, Longfei Shangguan, Minjia Zhang, Chenchen Liu, and Xiang Chen. 2022. A Survey of Multi-Tenant Deep Learning Inference on GPU. arXiv:2203.09040 [cs.DC]
[70]
Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2013. The Yin and Yang of Processing Data Warehousing Queries on GPU Devices. PVLDB (2013), 817--828.
[71]
Yao Zhang and John D. Owens. 2011. A Quantitative Performance Analysis Model for GPU Architectures. In HCPA. 382--393.

Cited By

View all
  • (2024)Accelerating GPU Data Processing using FastLanes CompressionProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663450(1-11)Online publication date: 10-Jun-2024
  • (2024)How Does Software Prefetching Work on GPU Query Processing?Proceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663445(1-9)Online publication date: 10-Jun-2024
  • (2024)Heterogeneous Intra-Pipeline Device-Parallel AggregationsProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663441(1-10)Online publication date: 10-Jun-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 17, Issue 3
November 2023
353 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 November 2023
Published in PVLDB Volume 17, Issue 3

Check for updates

Badges

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)663
  • Downloads (Last 6 weeks)48
Reflects downloads up to 29 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Accelerating GPU Data Processing using FastLanes CompressionProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663450(1-11)Online publication date: 10-Jun-2024
  • (2024)How Does Software Prefetching Work on GPU Query Processing?Proceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663445(1-9)Online publication date: 10-Jun-2024
  • (2024)Heterogeneous Intra-Pipeline Device-Parallel AggregationsProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663441(1-10)Online publication date: 10-Jun-2024
  • (2024)FPGA Based Database Sort-Aggregation Query Acceleration Architecture2024 9th International Symposium on Computer and Information Processing Technology (ISCIPT)10.1109/ISCIPT61983.2024.10672642(204-208)Online publication date: 24-May-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media