Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3662010.3663441acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

Heterogeneous Intra-Pipeline Device-Parallel Aggregations

Published: 09 June 2024 Publication History

Abstract

The rising hardware heterogeneity in modern systems emphasizes new dimensions of optimizing task execution for data processing frameworks. Specialized hardware is often expected to be the exclusive executor of some particular workload because it was designed for it or is simply the fastest option. In heterogeneous database systems, almost always, the entire operation offloading is considered. However, little attention was given to database systems with horizontal cross-device pipeline parallelization. We argue that such an approach can be applied to systems with morsel-driven parallelism and improve performance. We apply our parallelization strategy to an existing system and accelerate aggregations using two devices by up to 1.5x compared to the fastest exclusive device executor.

References

[1]
2024. Apache Arrow. https://arrow.apache.org/ Accessed on 16.02.2024.
[2]
2024. CUDA Driver API. https://docs.nvidia.com/cuda/cuda-driver-api/ Accessed on 07.02.2024.
[3]
2024. HDK. https://github.com/intel-ai/hdk Accessed on 16.02.2024.
[4]
2024. Heavy.AI. https://www.heavy.ai/ Accessed on 07.02.2024.
[5]
2024. Intel Graphics Compiler. https://github.com/intel/intel-graphics-compiler Accessed on 08.02.2024.
[6]
2024. NVIDIA CUDA Compiler. https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html Accessed on 08.02.2024.
[7]
2024. OneAPI Level Zero Specification. https://spec.oneapi.io/level-zero/latest/index.html Accessed on 07.02.2024.
[8]
2024. Pull request with a transfer scheme improvement proposal. https://github.com/intel-ai/hdk/pull/711 Accessed on 07.05.2024.
[9]
2024. SYCL. https://www.khronos.org/sycl/ Accessed on 07.02.2024.
[10]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv:1603.04467 [cs.DC]
[11]
Hartwig Anzt, Yuhsiang M. Tsai, Ahmad Abdelfattah, Terry Cojean, and Jack Dongarra. 2020. Evaluating the Performance of NVIDIA's A100 Ampere GPU for Sparse and Batched Computations. In 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 26--38. https://doi.org/10.1109/PMBS51919.2020.00009
[12]
Krste Asanović, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
[13]
Cagri Balkesen, Jens Teubner, Gustavo Alonso, and M. Tamer Özsu. 2013. Mainmemory hash joins on multi-core CPUs: Tuning to the underlying hardware. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). 362--373. https://doi.org/10.1109/ICDE.2013.6544839
[14]
Sebastian Breß. 2013. Why it is time for a HyPE: A Hybrid Query Processing Engine for Efficient GPU Coprocessing in DBMS. Proc. VLDB Endow. 6, 12 (2013), 1398--1403. https://doi.org/10.14778/2536274.2536325
[15]
Sebastian Breß, Henning Funke, and Jens Teubner. 2016. Robust Query Processing in Co-Processor-Accelerated Databases. In Proceedings of the 2016 International Conference on Management ofData (San Francisco, California, USA) (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 1891--1906. https://doi.org/10.1145/2882903.2882936
[16]
Sebastian Breß, Max Heimel, Michael Saecker, Bastian Köcher, Volker Markl, and Gunter Saake. 2014. Ocelot/HyPE: Optimized Data Processing on Heterogeneous Hardware. Proc. VLDB Endow. 7, 13 (2014), 1609--1612. https://doi.org/10.14778/2733004.2733042
[17]
Sebastian Breß, Norbert Siegmund, Ladjel Bellatreche, and Gunter Saake. 2013. An Operator-Stream-Based Scheduling Engine for Effective GPU Coprocessing. In Advances in Databases and Information Systems, Barbara Catania, Giovanna Guerrini, and Jaroslav Pokorný (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 288--301.
[18]
S. Breß, Igor Geist, E. Schallehn, M. Mory, and Gunter Saake. 2012. A framework for cost based optimization of hybrid CPU/GPU query plans in database systems. Control and Cybernetics 41 (01 2012), 715--742.
[19]
Sebastian Breß, Max Heimel, Norbert Siegmund, Ladjel Bellatreche, and Gunter Saake. 2014. GPU-Accelerated Database Systems: Survey and Open Challenges. Vol.8920. 1--35. https://doi.org/10.1007/978-3-662-45761-0_1
[20]
Jiashen Cao, Rathijit Sen, Matteo Interlandi, Joy Arulraj, and Hyesoon Kim. 2023. GPU Database Systems Characterization and Optimization. Proc. VLDB Endow. 17, 3 (nov 2023), 441--454. https://doi.org/10.14778/3632093.3632107
[21]
Periklis Chrysogelos, Manos Karpathiotakis, R. Appuswamy, and Anastasia Ailamaki. 2019. HetExchange: encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines. Proceedings of the VLDB Endowment 12 (01 2019), 544--556. https://doi.org/10.14778/3303753.3303760
[22]
Hawon Chu, Seounghyun Kim, Joo-Young Lee, and Young-Kyoon Suh. 2020. Empirical evaluation across multiple GPU-accelerated DBMSes. In Proceedings of the 16th International Workshop on Data Management on New Hardware (Portland, Oregon) (DaMoN '20). Association for Computing Machinery, New York, NY, USA, Article 16, 3 pages. https://doi.org/10.1145/3399666.3399907
[23]
Kayhan Dursun, Carsten Binnig, Ugur Çetintemel, Garret Swart, and Weiwei Gong. 2019. A Morsel-Driven Query Execution Engine for Heterogeneous Multi-Cores. Proc. VLDB Endow. 12, 12 (2019), 2218--2229.
[24]
Emily Furst, Mark Oskin, and Bill Howe. 2017. Profiling a GPU Database Implementation: A Holistic View of GPU Resource Utilization on TPC-H Queries. In Proceedings of the 13th International Workshop on Data Management on New Hardware (Chicago, Illinois) (DAMON '17). Association for Computing Machinery, New York, NY, USA, Article 3, 6 pages. https://doi.org/10.1 145/30761 13.3076119
[25]
Wilfred Gomes, Altug Koker, Pat Stover, Doug Ingerly, Scott Siers, Srikrishnan Venkataraman, Chris Pelto, Tejas Shah, Amreesh Rao, Frank O'Mahony, Eric Karl, Lance Cheney, Iqbal Rajwani, Hemant Jain, Ryan Cortez, Arun Chandrasekhar, Basavaraj Kanthi, and Raja Koduri. 2022. Ponte Vecchio: A Multi-Tile 3D Stacked Processor for Exascale Computing. In 2022 IEEE International Solid-State Circuits Conference (ISSCC), Vol. 65. 42--44. https://doi.org/10.1109/ISSCC42614.2022.9731673
[26]
Goetz Graefe. 1990. Encapsulation of parallelism in the Volcano query processing system. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data (Atlantic City, New Jersey, USA) (SIGMOD '90). Association for Computing Machinery, New York, NY, USA, 102--111. https://doi.org/10.1145/93597.98720
[27]
Chris Gregg and Kim Hazelwood. 2011. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. ISPASS 2011 - IEEE International Symposium on Performance Analysis of Systems and Software, 134--144. https://doi.org/10.1109/ISPASS.2011.5762730
[28]
Bingsheng He, Mian Lu, Ke Yang, Rui Fang, Naga K. Govindaraju, Qiong Luo, and Pedro V. Sander. 2009. Relational Query Coprocessing on Graphics Processors. ACM Trans. Database Syst. 34, 4, Article 21 (dec 2009), 39 pages. https://doi.org/10.1145/1620585.1620588
[29]
Jiong He, Mian Lu, and Bingsheng He. 2013. Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture. Proc. VLDB Endow. 6, 10 (aug 2013), 889--900. https://doi.org/10.14778/2536206.2536216
[30]
Max Heimel, Michael Saecker, Holger Pirk, Stefan Manegold, and Volker Markl. 2013. Hardware-Oblivious Parallelism for In-Memory Column-Stores. Proceedings of the VLDB Endowment 6 (08 2013). https://doi.org/10.14778/2536360.2536370
[31]
Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Clifford Young, Xiang Zhou, Zongwei Zhou, and David A Patterson. 2023. TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings. In Proceedings of the 50th Annual International Symposium on Computer Architecture (Orlando, FL, USA) (ISCA '23). Association for Computing Machinery, New York, NY, USA, Article 82, 14 pages. https://doi.org/10.1145/3579371.3589350
[32]
Homin Kang, Jaehong Lee, and Duksu Kim. 2021. HI-FFT: Heterogeneous Parallel In-place Algorithm for Large-scale 2D-FFT. IEEE Access PP (08 2021), 1--1. https://doi.org/10.1109/ACCESS.2021.3108404
[33]
Tomas Karnagel, Dirk Habich, and Wolfgang Lehner. 2017. Adaptive Work Placement for Query Processing on Heterogeneous Computing Resources. Proc. VLDB Endow. 10, 7 (mar 2017), 733--744. https://doi.org/10.14778/3067421.3067423
[34]
Alexandros Koliousis, Matthias Weidlich, Raul Castro Fernandez, Alexander L. Wolf, Paolo Costa, and Peter Pietzuch. 2016. SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 555--569. https://doi.org/10.1145/2882903.2882906
[35]
Petr Kurapov and Areg Melik-Adamyan. 2023. Analytical Queries: A Comprehensive Survey. arXiv:2311.15730 [cs.DB]
[36]
Viktor Leis, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2014. Morsel-driven parallelism: a NUMA-aware query evaluation framework for the manycore age. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (Snowbird, Utah, USA) (SIGMOD '14). Association for Computing Machinery, New York, NY, USA, 743--754. https://doi.org/10.1145/2588555.2610507
[37]
Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R. Tallent, and Kevin J. Barker. 2020. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Transactions on Parallel and Distributed Systems 31, 1 (2020), 94--110. https://doi.org/10.1109/TPDS.2019.2928289
[38]
Jing Li, Hung-Wei Tseng, Chunbin Lin, Yannis Papakonstantinou, and Steven Swanson. 2016. HippogriffDB: Balancing I/O and GPU Bandwidth in Big Data Analytics. Proc. VLDB Endow. 9, 14 (oct 2016), 1647--1658. https://doi.org/10.14778/3007328.3007331
[39]
Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2020. Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 1633--1649. https://doi.org/10.1145/3318464.3389705
[40]
Sparsh Mittal and Jeffrey S. Vetter. 2015. A Survey of CPU-GPU Heterogeneous Computing Techniques. ACM Comput. Surv. 47, 4, Article 69 (jul 2015), 35 pages. https://doi.org/10.1145/2788396
[41]
Tobias Mühlbauer, Wolf Rödiger, Robert Seilbeck, Alfons Kemper, and Thomas Neumann. 2014. Heterogeneity-Conscious Parallel Query Execution: Getting a Better Mileage While Driving Faster!. In Proceedings of the Tenth International Workshop on Data Management on New Hardware (Snowbird, Utah) (DaMoN '14). Association for Computing Machinery, New York, NY, USA, Article 2, 10 pages. https://doi.org/10.1145/2619228.2619230
[42]
Yasuhito Ogata, Toshio Endo, Naoya Maruyama, and Satoshi Matsuoka. 2008. An efficient, model-based CPU-GPU heterogeneous FFT library. (2008), 1--10. https://doi.org/10.1109/IPDPS.2008.4536163
[43]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv:1912.01703 [cs.LG]
[44]
Johns Paul, Jiong He, and Bingsheng He. 2016. GPL: A GPU-Based Pipelined Query Processing Engine. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 1935--1950. https://doi.org/10.1145/2882903.2915224
[45]
Chrysogelos Periklis, Panagiotis Sioulas, and Anastasia Ailamaki. 2019. Hardware-conscious Query Processing in GPU-accelerated Analytical Engines. In 9th Biennial Conference on Innovative Data Systems Research, CIDR 2019, Asilomar, CA, USA, January 13-16, 2019, Online Proceedings. www.cidrdb.org. http://cidrdb.org/cidr2019/papers/p127-chrysogelos-cidr19.pdf
[46]
Viktor Rosenfeld, Sebastian Breß, and Volker Markl. 2022. Query Processing on Heterogeneous CPU/GPU Systems. ACM Comput. Surv. 55, 1, Article 11 (jan 2022),38 pages. https://doi.org/10.1145/3485126
[47]
Anil Shanbhag, Samuel Madden, and Xiangyao Yu. 2020. A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 1617--1632. https://doi.org/10.1 145/3318464.3380595
[48]
Marc Tallada and Enric Morancho. 2023. Heterogeneous programming using OpenMP and CUDA/HIP for hybrid CPU-GPU scientific applications. The International Journal of High Performance Computing Applications 37 (08 2023). https://doi.org/10.1177/10943420231188079
[49]
Benjamin Wagner, André Kohn, and Thomas Neumann. 2021. Self-Tuning Query Scheduling for Analytical Workloads. In SIGMOD Conference. ACM, 1879--1891.
[50]
Samuel Williams, Andrew Waterman, and David A. Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65--76.
[51]
Bobbi W. Yogatama, Weiwei Gong, and Xiangyao Yu. 2022. Orchestrating Data Placement and Query Execution in Heterogeneous CPU-GPU DBMS. Proc. VLDB Endow. 15, 11 (jul 2022), 2491--2503. https://doi.org/10.14778/3551793.3551809
[52]
Bowen Zhang, Yanyan Shen, Yanmin Zhu, and Jiadi Yu. 2018. A GPU-Accelerated Framework for Processing Trajectory Queries. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). 1037--1048. https://doi.org/10.1109/ICDE.2018.00097
[53]
Feng Zhang, Lin Yang, Shuhao Zhang, Bingsheng He, Wei Lu, and Xiaoyong Du. 2020. FineStream: Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated Architectures. In 2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, 633--647. https://www.usenix.org/conference/atc20/presentation/zhang-feng
[54]
Yansong Zhang, Yu Zhang, Jiaheng Lu, Shan Wang, Zhuan Liu, and Ruichen Han. 2020. One size does not fit all: accelerating OLAP workloads with GPUs. Distributed and Parallel Databases 38 (12 2020). https://doi.org/10.1007/s10619-020-07304-z

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DaMoN '24: Proceedings of the 20th International Workshop on Data Management on New Hardware
June 2024
123 pages
ISBN:9798400706677
DOI:10.1145/3662010
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Query engine
  2. dedicated GPUs
  3. heterogeneous query processing

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SIGMOD/PODS '24
Sponsor:

Acceptance Rates

DaMoN '24 Paper Acceptance Rate 14 of 25 submissions, 56%;
Overall Acceptance Rate 94 of 127 submissions, 74%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 282
    Total Downloads
  • Downloads (Last 12 months)282
  • Downloads (Last 6 weeks)34
Reflects downloads up to 16 Jan 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media