research-article

Public Access

Fast Equi-Join Algorithms on GPUs: Design and Implementation

Authors:

Yi-Cheng TuAuthors Info & Claims

SSDBM '17: Proceedings of the 29th International Conference on Scientific and Statistical Database Management

Article No.: 17, Pages 1 - 12

https://doi.org/10.1145/3085504.3085521

Published: 27 June 2017 Publication History

Abstract

Processing relational joins on modern GPUs has attracted much attention in the past few years. With the rapid development on the hardware and software environment in the GPU world, the existing GPU join algorithms designed for earlier architecture cannot make the most out of latest GPU products. In this paper, we report new design and implementation of join algorithms with high performance under today's GPGPU environment. This is a key component of our scientific database engine named G-SDMS. In particular, we overhaul the popular radix hash join and redesign sort-merge join algorithms on GPUs by applying a series of novel techniques to utilize the hardware capacity of latest Nvidia GPU architecture and new features of the CUDA programming framework. Our algorithms take advantage of revised hardware arrangement, larger register file and shared memory, native atomic operation, dynamic parallelism, and CUDA Streams. Experiments show that our new hash join algorithm is 2.0 to 14.6 times as efficient as existing GPU implementation, while the new sort-merge join achieves a speedup of 4.0X to 4.9X. Compared to the best CPU sort-merge join and hash join known to date, our optimized code achieves up to 10.5X and 5.5X speedup. Moreover, we extend our design to scenarios where large data tables cannot fit in the GPU memory.

References

[1]

2005. GPU Gems 2, Chapter 46. (Mar 2005). http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter46.html.

[2]

Martina-Cezara Albutiu, Alfons Kemper, and Thomas Neumann. 2012. Massively Parallel Sort-merge Joins in Main Memory Multi-core Database Systems. Proc. VLDB Endow. 5, 10 (June 2012), 1064--1075.

Digital Library

[3]

Peter Bakkum and Kevin Skadron. 2010. Accelerating SQL Database Operations on a GPU with CUDA. In Procs. 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU '10). 94--103.

Digital Library

[4]

Cagri Balkesen, Gustavo Alonso, Jens Teubner, and M. Tamer Özsu. 2013. Multi-core, Main-memory Joins: Sort vs. Hash Revisited. Proc. VLDB Endow. 7, 1 (Sept. 2013), 85--96.

Digital Library

[5]

C. Balkesen, J. Teubner, G. Alonso, and M. T...zsu. 2013. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In ICDE. 362--373.

Digital Library

[6]

Nagender Bandi, Chengyu Sun, Divyakant Agrawal, and Amr El Abbadi. 2004. Hardware Acceleration in Commercial Databases: A Case Study of Spatial Operations. In Procs. of VLDB. 1021--1032.

Digital Library

[7]

R. Barber, G. Lohman, I. Pandis, V. Raman, R. Sidle, G. Attaluri, N. Chainani, S. Lightstone, and D. Sharpe. 2014. Memory-efficient Hash Joins. Proc. VLDB Endow. 8, 4 (Dec. 2014), 353--364.

Digital Library

[8]

Spyros Blanas, Yinan Li, and Jignesh M. Patel. 2011. Design and Evaluation of Main Memory Hash Join Algorithms for Multi-core CPUs. In Procs. of SIGMOD. 37--48.

Digital Library

[9]

Naga K. Govindaraju, Brandon Lloyd, Wei Wang, Ming Lin, and Dinesh Manocha. 2004. Fast Computation of Database Operations Using Graphics Processors. In Procs. of SIGMOD. 215--226.

Digital Library

[10]

Oded Green, Robert McColl, and David A. Bader. 2012. GPU Merge Path: A GPU Merging Algorithm. In Procs of ICS. 331--340.

Digital Library

[11]

Bingsheng He, Mian Lu, Ke Yang, Rui Fang, Naga K. Govindaraju, Qiong Luo, and Pedro V. Sander. 2009. Relational Query Coprocessing on Graphics Processors. ACM Trans. Database Syst. 34, 4, Article 21 (Dec. 2009), 39 pages.

Digital Library

[12]

Bingsheng He, Ke Yang, Rui Fang, Mian Lu, Naga Govindaraju, Qiong Luo, and Pedro Sander. 2008. Relational Joins on Graphics Processors. In Procs. of SIGMOD. 511--524.

Digital Library

[13]

Jiong He, Mian Lu, and Bingsheng He. 2013. Revisiting Co-processing for Hash Joins on the Coupled CPU-GPU Architecture. Proc. VLDB Endowment 6, 10 (Aug. 2013), 889--900.

Digital Library

[14]

Tim Kaldewey, Guy Lohman, Rene Mueller, and Peter Volk. 2012. GPU Join Processing Revisited. In Procs. DaMoN. 55--62.

Digital Library

[15]

Changkyu Kim, Tim Kaldewey, Victor W. Lee, Eric Sedlar, Anthony D. Nguyen, Nadathur Satish, Jatin Chhugani, Andrea Di Blas, and Pradeep Dubey. 2009. Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-core CPUs. Proc. VLDB Endow. 2, 2 (Aug. 2009), 1378--1389.

Digital Library

[16]

Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. 2010. Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU. SIGARCH Comput. Archit. News 38, 3 (June 2010), 451--460.

Digital Library

[17]

S. Manegold, P. Boncz, and M. Kersten. 2002. Optimizing main-memory join on modern hardware. IEEE TKDE 14, 4 (Jul 2002), 709--730.

Digital Library

[18]

S. Odeh, O. Green, Z. Mwassi, O. Shmueli, and Y. Birk. 2012. Merge Path - Parallel Merging Made Simple. In IPDPSW. 1611--1618.

Digital Library

[19]

Ran Rui, Hao Li, and Yi-Cheng Tu. 2015. Join algorithms on GPUs: A revisit after seven years. In Big Data. 2541--2550.

Digital Library

[20]

Evangelia A. Sitaridi and Kenneth A. Ross. 2012. Ameliorating Memory Contention of OLAP Operators on GPU Processors. In DaMoN. 39--47.

Digital Library

[21]

Chengyu Sun, Divyakant Agrawal, and Amr El Abbadi. 2003. Hardware Acceleration for Spatial Selections and Joins. In Procs. of ACM Intl. Conf. on Management of Data (SIGMOD). 455--466.

Digital Library

[22]

Yi-Cheng Tu, Anand Kumar, Di Yu, Ran Rui, and Ryan Wheeler. 2013. Data Management Systems on GPUs: Promises and Challenges. In SSDBM. Article 33, 4 pages.

Digital Library

[23]

Haicheng Wu, Gregory Diamos, Tim Sheard, Molham Aref, Sean Baxter, Michael Garland, and Sudhakar Yalamanchili. 2014. Red Fox: An Execution Environment for Relational Query Processing on GPUs. In Procs. CGO. Article 44, 11 pages.

Digital Library

[24]

Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2013. The Yin and Yang of Processing Data Warehousing Queries on GPU Devices. Proc. VLDB Endowment 6,10 (Aug. 2013), 817--828.

Digital Library

[25]

Marco Zagha and Guy E. Blelloch. 1991. Radix Sort for Vector Multiprocessors. In Procs. 1991 ACM/IEEE Conference on Supercomputing (SC '91). 712--721.

Digital Library

Cited By

Wu BKoutsoukos DAlonso G(2025)Efficiently Processing Joins and Grouped Aggregations on GPUsProceedings of the ACM on Management of Data10.1145/37096893:1(1-27)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1145/3709689
XUE MWU WLUO JZHANG YZHAO B(2024)High-Parallelism and Pipelined Architecture for Accelerating Sort-Merge Join on FPGAIEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences10.1587/transfun.2023EAP1135E107.A:10(1582-1594)Online publication date: 1-Oct-2024
https://doi.org/10.1587/transfun.2023EAP1135
Yogatama BGong WYu X(2024)Scaling your Hybrid CPU-GPU DBMS to Multiple GPUsProceedings of the VLDB Endowment10.14778/3704965.370497717:13(4709-4722)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.14778/3704965.3704977
Show More Cited By

Recommendations

Join algorithms on GPUs: A revisit after seven years
BIG DATA '15: Proceedings of the 2015 IEEE International Conference on Big Data (Big Data)

Implementing database operations on parallel platforms has gain a lot of momentum in the past decade. A number of studies have shown the potential of using GPUs to speed up database operations. In this paper, we present empirical evaluations of a state-...
Fast Parallel Connected Components Algorithms on GPUs
Revised Selected Papers, Part I, of the Euro-Par 2014 International Workshops on Parallel Processing - Volume 8805

We study parallel connected components algorithms on GPUs in comparison with CPUs. Although straightforward implementation of PRAM algorithms performs relatively better on GPUs than on CPUs, the GPU memory subsystem performance is poor due to non-...
Designing efficient sorting algorithms for manycore GPUs
IPDPS '09: Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing

We describe the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparison-...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SSDBM '17: Proceedings of the 29th International Conference on Scientific and Statistical Database Management

June 2017

373 pages

ISBN:9781450352826

DOI:10.1145/3085504

General Chair:
Alok Choudhary,
Program Chair:
Kesheng Wu,
Publications Chair:
Bin Dong

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Northwestern University: Northwestern University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation

Conference

SSDBM '17

SSDBM '17: 29th International Conference on Scientific and Statistical Database Management

June 27 - 29, 2017

IL, Chicago, USA

Acceptance Rates

Overall Acceptance Rate 56 of 146 submissions, 38%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

35
Total Citations
View Citations
815
Total Downloads

Downloads (Last 12 months)222
Downloads (Last 6 weeks)20

Reflects downloads up to 23 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu BKoutsoukos DAlonso G(2025)Efficiently Processing Joins and Grouped Aggregations on GPUsProceedings of the ACM on Management of Data10.1145/37096893:1(1-27)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1145/3709689
XUE MWU WLUO JZHANG YZHAO B(2024)High-Parallelism and Pipelined Architecture for Accelerating Sort-Merge Join on FPGAIEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences10.1587/transfun.2023EAP1135E107.A:10(1582-1594)Online publication date: 1-Oct-2024
https://doi.org/10.1587/transfun.2023EAP1135
Yogatama BGong WYu X(2024)Scaling your Hybrid CPU-GPU DBMS to Multiple GPUsProceedings of the VLDB Endowment10.14778/3704965.370497717:13(4709-4722)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.14778/3704965.3704977
Pham MYuan YLi HMou CTu YXu ZMeng J(2024)Dynamic Buffer Management in Massively Parallel Systems: The Power of RandomnessACM Transactions on Parallel Computing10.1145/370162312:1(1-33)Online publication date: 11-Nov-2024
https://dl.acm.org/doi/10.1145/3701623
Lee SLim CChoi JChoi HLee CPark YPark KKim HKim Y(2024)SPID-Join: A Skew-resistant Processing-in-DIMM Join Algorithm Exploiting the Bank- and Rank-level Parallelisms of DIMMsProceedings of the ACM on Management of Data10.1145/36988272:6(1-27)Online publication date: 20-Dec-2024
https://dl.acm.org/doi/10.1145/3698827
Cai YChen S(2024)CPU and GPU Hash Joins on Skewed Data2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW61823.2024.00064(402-408)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDEW61823.2024.00064
Doraiswamy HKalagi VRamachandra KHaritsa J(2023)A Case for Graphics-Driven Query ProcessingProceedings of the VLDB Endowment10.14778/3603581.360359016:10(2499-2511)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.14778/3603581.3603590
Yogatama BMiller BWang YMarkall GHemstad JKimball GYu X(2023)Accelerating User-Defined Aggregate Functions (UDAF) with Block-wide Execution and JIT Compilation on GPUsProceedings of the 19th International Workshop on Data Management on New Hardware10.1145/3592980.3595307(19-26)Online publication date: 18-Jun-2023
https://dl.acm.org/doi/10.1145/3592980.3595307
Sun WKatsifodimos AHai R(2023)An Empirical Performance Comparison between Matrix Multiplication Join and Hash Join on GPUs2023 IEEE 39th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW58674.2023.00034(184-190)Online publication date: Apr-2023
https://doi.org/10.1109/ICDEW58674.2023.00034
Yogatama BGong WYu X(2022)Orchestrating data placement and query execution in heterogeneous CPU-GPU DBMSProceedings of the VLDB Endowment10.14778/3551793.355180915:11(2491-2503)Online publication date: 29-Sep-2022
https://dl.acm.org/doi/10.14778/3551793.3551809
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten