research-article

Open access

Heterogeneous Intra-Pipeline Device-Parallel Aggregations

Authors:

Artem Kroviakov,

Christoph Anneser,

Jana GicevaAuthors Info & Claims

DaMoN '24: Proceedings of the 20th International Workshop on Data Management on New Hardware

Article No.: 3, Pages 1 - 10

https://doi.org/10.1145/3662010.3663441

Published: 09 June 2024 Publication History

Abstract

The rising hardware heterogeneity in modern systems emphasizes new dimensions of optimizing task execution for data processing frameworks. Specialized hardware is often expected to be the exclusive executor of some particular workload because it was designed for it or is simply the fastest option. In heterogeneous database systems, almost always, the entire operation offloading is considered. However, little attention was given to database systems with horizontal cross-device pipeline parallelization. We argue that such an approach can be applied to systems with morsel-driven parallelism and improve performance. We apply our parallelization strategy to an existing system and accelerate aggregations using two devices by up to 1.5x compared to the fastest exclusive device executor.

References

[1]

2024. Apache Arrow. https://arrow.apache.org/ Accessed on 16.02.2024.

[2]

2024. CUDA Driver API. https://docs.nvidia.com/cuda/cuda-driver-api/ Accessed on 07.02.2024.

[3]

2024. HDK. https://github.com/intel-ai/hdk Accessed on 16.02.2024.

[4]

2024. Heavy.AI. https://www.heavy.ai/ Accessed on 07.02.2024.

[5]

2024. Intel Graphics Compiler. https://github.com/intel/intel-graphics-compiler Accessed on 08.02.2024.

[6]

2024. NVIDIA CUDA Compiler. https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html Accessed on 08.02.2024.

[7]

2024. OneAPI Level Zero Specification. https://spec.oneapi.io/level-zero/latest/index.html Accessed on 07.02.2024.

[8]

2024. Pull request with a transfer scheme improvement proposal. https://github.com/intel-ai/hdk/pull/711 Accessed on 07.05.2024.

[9]

2024. SYCL. https://www.khronos.org/sycl/ Accessed on 07.02.2024.

[10]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv:1603.04467 [cs.DC]

[11]

Hartwig Anzt, Yuhsiang M. Tsai, Ahmad Abdelfattah, Terry Cojean, and Jack Dongarra. 2020. Evaluating the Performance of NVIDIA's A100 Ampere GPU for Sparse and Batched Computations. In 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 26--38. https://doi.org/10.1109/PMBS51919.2020.00009

[12]

Krste Asanović, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html

[13]

Cagri Balkesen, Jens Teubner, Gustavo Alonso, and M. Tamer Özsu. 2013. Mainmemory hash joins on multi-core CPUs: Tuning to the underlying hardware. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). 362--373. https://doi.org/10.1109/ICDE.2013.6544839

Digital Library

[14]

Sebastian Breß. 2013. Why it is time for a HyPE: A Hybrid Query Processing Engine for Efficient GPU Coprocessing in DBMS. Proc. VLDB Endow. 6, 12 (2013), 1398--1403. https://doi.org/10.14778/2536274.2536325

Digital Library

[15]

Sebastian Breß, Henning Funke, and Jens Teubner. 2016. Robust Query Processing in Co-Processor-Accelerated Databases. In Proceedings of the 2016 International Conference on Management ofData (San Francisco, California, USA) (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 1891--1906. https://doi.org/10.1145/2882903.2882936

Digital Library

[16]

Sebastian Breß, Max Heimel, Michael Saecker, Bastian Köcher, Volker Markl, and Gunter Saake. 2014. Ocelot/HyPE: Optimized Data Processing on Heterogeneous Hardware. Proc. VLDB Endow. 7, 13 (2014), 1609--1612. https://doi.org/10.14778/2733004.2733042

Digital Library

[17]

Sebastian Breß, Norbert Siegmund, Ladjel Bellatreche, and Gunter Saake. 2013. An Operator-Stream-Based Scheduling Engine for Effective GPU Coprocessing. In Advances in Databases and Information Systems, Barbara Catania, Giovanna Guerrini, and Jaroslav Pokorný (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 288--301.

[18]

S. Breß, Igor Geist, E. Schallehn, M. Mory, and Gunter Saake. 2012. A framework for cost based optimization of hybrid CPU/GPU query plans in database systems. Control and Cybernetics 41 (01 2012), 715--742.

[19]

Sebastian Breß, Max Heimel, Norbert Siegmund, Ladjel Bellatreche, and Gunter Saake. 2014. GPU-Accelerated Database Systems: Survey and Open Challenges. Vol.8920. 1--35. https://doi.org/10.1007/978-3-662-45761-0_1

[20]

Jiashen Cao, Rathijit Sen, Matteo Interlandi, Joy Arulraj, and Hyesoon Kim. 2023. GPU Database Systems Characterization and Optimization. Proc. VLDB Endow. 17, 3 (nov 2023), 441--454. https://doi.org/10.14778/3632093.3632107

Digital Library

[21]

Periklis Chrysogelos, Manos Karpathiotakis, R. Appuswamy, and Anastasia Ailamaki. 2019. HetExchange: encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines. Proceedings of the VLDB Endowment 12 (01 2019), 544--556. https://doi.org/10.14778/3303753.3303760

Digital Library

[22]

Hawon Chu, Seounghyun Kim, Joo-Young Lee, and Young-Kyoon Suh. 2020. Empirical evaluation across multiple GPU-accelerated DBMSes. In Proceedings of the 16th International Workshop on Data Management on New Hardware (Portland, Oregon) (DaMoN '20). Association for Computing Machinery, New York, NY, USA, Article 16, 3 pages. https://doi.org/10.1145/3399666.3399907

Digital Library

[23]

Kayhan Dursun, Carsten Binnig, Ugur Çetintemel, Garret Swart, and Weiwei Gong. 2019. A Morsel-Driven Query Execution Engine for Heterogeneous Multi-Cores. Proc. VLDB Endow. 12, 12 (2019), 2218--2229.

Digital Library

[24]

Emily Furst, Mark Oskin, and Bill Howe. 2017. Profiling a GPU Database Implementation: A Holistic View of GPU Resource Utilization on TPC-H Queries. In Proceedings of the 13th International Workshop on Data Management on New Hardware (Chicago, Illinois) (DAMON '17). Association for Computing Machinery, New York, NY, USA, Article 3, 6 pages. https://doi.org/10.1 145/30761 13.3076119

Digital Library

[25]

Wilfred Gomes, Altug Koker, Pat Stover, Doug Ingerly, Scott Siers, Srikrishnan Venkataraman, Chris Pelto, Tejas Shah, Amreesh Rao, Frank O'Mahony, Eric Karl, Lance Cheney, Iqbal Rajwani, Hemant Jain, Ryan Cortez, Arun Chandrasekhar, Basavaraj Kanthi, and Raja Koduri. 2022. Ponte Vecchio: A Multi-Tile 3D Stacked Processor for Exascale Computing. In 2022 IEEE International Solid-State Circuits Conference (ISSCC), Vol. 65. 42--44. https://doi.org/10.1109/ISSCC42614.2022.9731673

[26]

Goetz Graefe. 1990. Encapsulation of parallelism in the Volcano query processing system. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data (Atlantic City, New Jersey, USA) (SIGMOD '90). Association for Computing Machinery, New York, NY, USA, 102--111. https://doi.org/10.1145/93597.98720

Digital Library

[27]

Chris Gregg and Kim Hazelwood. 2011. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. ISPASS 2011 - IEEE International Symposium on Performance Analysis of Systems and Software, 134--144. https://doi.org/10.1109/ISPASS.2011.5762730

[28]

Bingsheng He, Mian Lu, Ke Yang, Rui Fang, Naga K. Govindaraju, Qiong Luo, and Pedro V. Sander. 2009. Relational Query Coprocessing on Graphics Processors. ACM Trans. Database Syst. 34, 4, Article 21 (dec 2009), 39 pages. https://doi.org/10.1145/1620585.1620588

Digital Library

[29]

Jiong He, Mian Lu, and Bingsheng He. 2013. Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture. Proc. VLDB Endow. 6, 10 (aug 2013), 889--900. https://doi.org/10.14778/2536206.2536216

Digital Library

[30]

Max Heimel, Michael Saecker, Holger Pirk, Stefan Manegold, and Volker Markl. 2013. Hardware-Oblivious Parallelism for In-Memory Column-Stores. Proceedings of the VLDB Endowment 6 (08 2013). https://doi.org/10.14778/2536360.2536370

Digital Library

[31]

Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Clifford Young, Xiang Zhou, Zongwei Zhou, and David A Patterson. 2023. TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings. In Proceedings of the 50th Annual International Symposium on Computer Architecture (Orlando, FL, USA) (ISCA '23). Association for Computing Machinery, New York, NY, USA, Article 82, 14 pages. https://doi.org/10.1145/3579371.3589350

Digital Library

[32]

Homin Kang, Jaehong Lee, and Duksu Kim. 2021. HI-FFT: Heterogeneous Parallel In-place Algorithm for Large-scale 2D-FFT. IEEE Access PP (08 2021), 1--1. https://doi.org/10.1109/ACCESS.2021.3108404

[33]

Tomas Karnagel, Dirk Habich, and Wolfgang Lehner. 2017. Adaptive Work Placement for Query Processing on Heterogeneous Computing Resources. Proc. VLDB Endow. 10, 7 (mar 2017), 733--744. https://doi.org/10.14778/3067421.3067423

Digital Library

[34]

Alexandros Koliousis, Matthias Weidlich, Raul Castro Fernandez, Alexander L. Wolf, Paolo Costa, and Peter Pietzuch. 2016. SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 555--569. https://doi.org/10.1145/2882903.2882906

Digital Library

[35]

Petr Kurapov and Areg Melik-Adamyan. 2023. Analytical Queries: A Comprehensive Survey. arXiv:2311.15730 [cs.DB]

[36]

Viktor Leis, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2014. Morsel-driven parallelism: a NUMA-aware query evaluation framework for the manycore age. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (Snowbird, Utah, USA) (SIGMOD '14). Association for Computing Machinery, New York, NY, USA, 743--754. https://doi.org/10.1145/2588555.2610507

Digital Library

[37]

Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R. Tallent, and Kevin J. Barker. 2020. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Transactions on Parallel and Distributed Systems 31, 1 (2020), 94--110. https://doi.org/10.1109/TPDS.2019.2928289

Digital Library

[38]

Jing Li, Hung-Wei Tseng, Chunbin Lin, Yannis Papakonstantinou, and Steven Swanson. 2016. HippogriffDB: Balancing I/O and GPU Bandwidth in Big Data Analytics. Proc. VLDB Endow. 9, 14 (oct 2016), 1647--1658. https://doi.org/10.14778/3007328.3007331

Digital Library

[39]

Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2020. Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 1633--1649. https://doi.org/10.1145/3318464.3389705

Digital Library

[40]

Sparsh Mittal and Jeffrey S. Vetter. 2015. A Survey of CPU-GPU Heterogeneous Computing Techniques. ACM Comput. Surv. 47, 4, Article 69 (jul 2015), 35 pages. https://doi.org/10.1145/2788396

Digital Library

[41]

Tobias Mühlbauer, Wolf Rödiger, Robert Seilbeck, Alfons Kemper, and Thomas Neumann. 2014. Heterogeneity-Conscious Parallel Query Execution: Getting a Better Mileage While Driving Faster!. In Proceedings of the Tenth International Workshop on Data Management on New Hardware (Snowbird, Utah) (DaMoN '14). Association for Computing Machinery, New York, NY, USA, Article 2, 10 pages. https://doi.org/10.1145/2619228.2619230

Digital Library

[42]

Yasuhito Ogata, Toshio Endo, Naoya Maruyama, and Satoshi Matsuoka. 2008. An efficient, model-based CPU-GPU heterogeneous FFT library. (2008), 1--10. https://doi.org/10.1109/IPDPS.2008.4536163

[43]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv:1912.01703 [cs.LG]

[44]

Johns Paul, Jiong He, and Bingsheng He. 2016. GPL: A GPU-Based Pipelined Query Processing Engine. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 1935--1950. https://doi.org/10.1145/2882903.2915224

Digital Library

[45]

Chrysogelos Periklis, Panagiotis Sioulas, and Anastasia Ailamaki. 2019. Hardware-conscious Query Processing in GPU-accelerated Analytical Engines. In 9th Biennial Conference on Innovative Data Systems Research, CIDR 2019, Asilomar, CA, USA, January 13-16, 2019, Online Proceedings. www.cidrdb.org. http://cidrdb.org/cidr2019/papers/p127-chrysogelos-cidr19.pdf

[46]

Viktor Rosenfeld, Sebastian Breß, and Volker Markl. 2022. Query Processing on Heterogeneous CPU/GPU Systems. ACM Comput. Surv. 55, 1, Article 11 (jan 2022),38 pages. https://doi.org/10.1145/3485126

Digital Library

[47]

Anil Shanbhag, Samuel Madden, and Xiangyao Yu. 2020. A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 1617--1632. https://doi.org/10.1 145/3318464.3380595

Digital Library

[48]

Marc Tallada and Enric Morancho. 2023. Heterogeneous programming using OpenMP and CUDA/HIP for hybrid CPU-GPU scientific applications. The International Journal of High Performance Computing Applications 37 (08 2023). https://doi.org/10.1177/10943420231188079

Digital Library

[49]

Benjamin Wagner, André Kohn, and Thomas Neumann. 2021. Self-Tuning Query Scheduling for Analytical Workloads. In SIGMOD Conference. ACM, 1879--1891.

[50]

Samuel Williams, Andrew Waterman, and David A. Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65--76.

Digital Library

[51]

Bobbi W. Yogatama, Weiwei Gong, and Xiangyao Yu. 2022. Orchestrating Data Placement and Query Execution in Heterogeneous CPU-GPU DBMS. Proc. VLDB Endow. 15, 11 (jul 2022), 2491--2503. https://doi.org/10.14778/3551793.3551809

Digital Library

[52]

Bowen Zhang, Yanyan Shen, Yanmin Zhu, and Jiadi Yu. 2018. A GPU-Accelerated Framework for Processing Trajectory Queries. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). 1037--1048. https://doi.org/10.1109/ICDE.2018.00097

[53]

Feng Zhang, Lin Yang, Shuhao Zhang, Bingsheng He, Wei Lu, and Xiaoyong Du. 2020. FineStream: Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated Architectures. In 2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, 633--647. https://www.usenix.org/conference/atc20/presentation/zhang-feng

[54]

Yansong Zhang, Yu Zhang, Jiaheng Lu, Shan Wang, Zhuan Liu, and Ruichen Han. 2020. One size does not fit all: accelerating OLAP workloads with GPUs. Distributed and Parallel Databases 38 (12 2020). https://doi.org/10.1007/s10619-020-07304-z

Digital Library

Index Terms

Heterogeneous Intra-Pipeline Device-Parallel Aggregations
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Query Processing on Heterogeneous CPU/GPU Systems
Due to their high computational power and internal memory bandwidth, graphic processing units (GPUs) have been extensively studied by the database systems research community. A heterogeneous query processing system that employs CPUs and GPUs at the same ...
Parallel Pipeline on Heterogeneous Multi-processing Architectures
TRUSTCOM-BIGDATASE-ISPA '15: Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 03

We address the problem of providing support for executing single streaming applications implemented as a pipeline of stages that run on heterogeneous chips comprised of several cores and one on-chip GPU. In this paper, we present an API that allows the ...
Parallel Pipeline on Heterogeneous Multi-processing Architectures
TRUSTCOM-BIGDATASE-ISPA '15: Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 03

We address the problem of providing support for executing single streaming applications implemented as a pipeline of stages that run on heterogeneous chips comprised of several cores and one on-chip GPU. In this paper, we present an API that allows the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DaMoN '24: Proceedings of the 20th International Workshop on Data Management on New Hardware

June 2024

123 pages

ISBN:9798400706677

DOI:10.1145/3662010

Editors:
Carsten Binnig
TU Darmstadt, Germany
,
Nesime Tatbul
Intel Labs and MIT, USA

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SIGMOD/PODS '24

Sponsor:

SIGMOD

SIGMOD/PODS '24: International Conference on Management of Data

June 10, 2024

AA, Santiago, Chile

Acceptance Rates

DaMoN '24 Paper Acceptance Rate 14 of 25 submissions, 56%;

Overall Acceptance Rate 94 of 127 submissions, 74%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
282
Total Downloads

Downloads (Last 12 months)282
Downloads (Last 6 weeks)34

Reflects downloads up to 16 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents