research-article

GPU multisplit

Authors:

Saman Ashkiani,

Andrew Davidson,

John D. OwensAuthors Info & Claims

PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Article No.: 12, Pages 1 - 13

https://doi.org/10.1145/2851141.2851169

Published: 27 February 2016 Publication History

Abstract

Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an efficient multisplit on GPUs, programmers often choose to implement multisplit with a sort. However, sort does more work than necessary to implement multisplit, and is thus inefficient. In this work, we provide a parallel model and multiple implementations for the multisplit problem. Our principal focus is multisplit for a small number of buckets. In our implementations, we exploit the computational hierarchy of the GPU to perform most of the work locally, with minimal usage of global operations. We also use warp-synchronous programming models to avoid branch divergence and reduce memory usage, as well as hierarchical reordering of input elements to achieve better coalescing of global memory accesses. On an NVIDIA K40c GPU, for key-only (key-value) multisplit, we demonstrate a 3.0-6.7x (4.4-8.0x) speedup over radix sort, and achieve a peak throughput of 10.0 G keys/s.

References

[1]

The Graph 500 list. http://www.graph500.org/, July 2013.

[2]

Yahoo labs dataset selections. http://webscope.sandbox.yahoo.com/, July 2013.

[3]

D. A. Alcantara, A. Sharf, F. Abbasinejad, S. Sengupta, M. Mitzenmacher, J. D. Owens, and N. Amenta. Real-time parallel hashing on the GPU. ACM Transactions on Graphics, 28(5):154:1--154:9, Dec. 2009.

Digital Library

[4]

A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan. Fast sparse matrix-vector multiplication on GPUs for graph applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '14, pages 781--792, Nov. 2014.

Digital Library

[5]

J. Bang-Jensen and G. Z. Gutin. Digraphs: Theory, Algorithms and Applications, chapter 3.3.4: The Bellman-Ford-Moore Algorithm, pages 97--99. Springer-Verlag London, 2009.

[6]

S. Brown and J. Snoeyink. Modestly faster histogram computations on GPUs. In Proceedings of Innovative Parallel Computing, InPar '12, May 2012.

[7]

A. Davidson, S. Baxter, M. Garland, and J. D. Owens. Work-efficient parallel GPU methods for single source shortest paths. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2014, pages 349--359, May 2014.

Digital Library

[8]

T. A. Davis and Y. Hu. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS), 38(1):1, 2011.

Digital Library

[9]

M. Deo and S. Keely. Parallel suffix array and least common prefix for the GPU. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 197--206, Feb. 2013.

Digital Library

[10]

A. Deshpande and P. J. Narayanan. Can GPUs sort strings efficiently? In 20th International Conference on High Performance Computing, HiPC 2013, pages 305--313, Dec. 2013.

[11]

G. F. Diamos, H. Wu, A. Lele, J. Wang, and S. Yalamanchili. Efficient relational algebra algorithms and data structures for GPU. Technical Report GIT-CERCS-12-01, Georgia Institute of Technology Center for Experimental Research in Computer Systems, Feb. 2012. URL http://www.cercs.gatech.edu/tech-reports/tr2012/git-cercs-12-01.pdf.

[12]

E. W. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik, 1(1):269--271, 1959. ISSN 0029-599X.

Digital Library

[13]

M. Harris, S. Sengupta, and J. D. Owens. Parallel prefix sum (scan) with CUDA. In H. Nguyen, editor, GPU Gems 3, chapter 39, pages 851--876. Addison Wesley, Aug. 2007.

[14]

B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander. Relational joins on graphics processors. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pages 511--524, June 2008.

Digital Library

[15]

Q. Hou, X. Sun, K. Zhou, C. Lauterbach, and D. Manocha. Memory-scalable GPU spatial hierarchy construction. IEEE Transactions on Visualization and Computer Graphics, 17(4):466--474, Apr. 2011.

Digital Library

[16]

E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 28(2): 39--55, Mar./Apr. 2008.

Digital Library

[17]

D. Merrill and A. Grimshaw. Revisiting sorting for GPGPU stream architectures. Technical Report CS2010-03, Department of Computer Science, University of Virginia, Feb. 2010. URL https://sites.google.com/site/duanemerrill/RadixSortTR.pdf.

Digital Library

[18]

U. Meyer. Buckets strike back: Improved parallel shortest paths. In Proceedings of the 16th International Parallel and Distributed Processing Symposium, IPDPS 2002, Apr. 2002. 2002.1015582.

Digital Library

[19]

U. Meyer. Average-case complexity of single-source shortest-paths algorithms: lower and upper bounds. Journal of Algorithms, 48(1): 91--134, Aug. 2003.

Digital Library

[20]

U. Meyer and P. Sanders. Δ-stepping: a parallelizable shortest path algorithm. Journal of Algorithms, 49(1):114--152, Oct. 2003. 1998 European Symposium on Algorithms.

Digital Library

[21]

G. L. Miller and J. H. Reif. Parallel tree contraction---Part 1: Fundamentals. In S. Micali, editor, Randomness and Computation, volume 5 of Advances in Computing Research, pages 47--72. JAI Press Inc., 1989. ISBN 9780892328963.

[22]

L. Monroe, J. Wendelberger, and S. Michalak. Randomized selection on the GPU. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, HPG '11, pages 89--98, Aug. 2011.

Digital Library

[23]

J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue, 6(2):40--53, Mar./Apr. 2008.

Digital Library

[24]

C. Nugteren, G.-J. van den Braak, H. Corporaal, and B. Mesman. High performance predictable histogramming on GPUs: Exploring and evaluating algorithm trade-offs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, page 1. ACM, 2011.

Digital Library

[25]

NVIDIA Corporation. NVIDIA CUDA C programming guide. PG-02829-001 v6.5, Aug. 2014.

[26]

J. Pantaleoni. VoxelPipe: A programmable pipeline for 3D voxelization. In Proceedings of High Performance Graphics, HPG '11, pages 99--106, Aug. 2011. ISBN 978-1-4503-0896-0.

Digital Library

[27]

S. Patidar. Scalable primitives for data mapping and movement on the GPU. Master's thesis, International Institute of Information Technology, Hyderabad, India, June 2009.

[28]

R. Shams and R. A. Kennedy. Efficient histogram algorithms for NVIDIA CUDA compatible devices. In Proceedings of the International Conference on Signal Processing and Communications Systems (ICSPCS), pages 418--422, Gold Coast, Australia, Dec. 2007.

[29]

Z. Wu, F. Zhao, and X. Liu. SAH KD-tree construction on GPU. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, HPG '11, pages 71--78, Aug. 2011. 2018335.

Digital Library

[30]

X. Yang, D. Xu, and L. Zhao. Efficient data management for incoherent ray tracing. Applied Soft Computing, 13(1):1--8, Jan. 2013.

Digital Library

Cited By

Jünger DKobus RMüller AHundt CXu KLiu WSchmidt B(2022)General-purpose GPU Hashing Data Structures and their Application in Accelerated GenomicsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.01.006Online publication date: Feb-2022
https://doi.org/10.1016/j.jpdc.2022.01.006
Wang KFussell DLin CLee JPetrank E(2021)A fast work-efficient SSSP algorithm for GPUsProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441605(133-146)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3437801.3441605
Junger DKobus RMuller AHundt CXu KLiu WSchmidt B(2020)WarpCore: A Library for fast Hash Tables on GPUs2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC50609.2020.00015(11-20)Online publication date: Dec-2020
https://doi.org/10.1109/HiPC50609.2020.00015
Show More Cited By

Recommendations

GPU Multisplit: An Extended Study of a Parallel Algorithm
Special Issue: Invited papers from PPoPP 2016, Part 1

Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an efficient multisplit on ...
GPU multisplit
PPoPP '16

Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an efficient multisplit on ...
Hybrid DRAM/PRAM-based main memory for single-chip CPU/GPU
DAC '12: Proceedings of the 49th Annual Design Automation Conference

Single-chip CPU/GPU architecture is being adopted in high-end (embedded) systems, e.g., smartphones and tablet PCs. Main memory subsystem is expected to consist of hybrid DRAM and phase-change RAM (PRAM) due to the difficulties in DRAM scaling. In this ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 2016

420 pages

ISBN:9781450340922

DOI:10.1145/2851141

General Chair:
Rafael Asenjo
University of Málaga, Spain
,
Program Chair:
Tim Harris
Oracle Labs, Cambridge, UK

ACM SIGPLAN Notices Volume 51, Issue 8
PPoPP '16
August 2016
405 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3016078
Editor:
Matthew Fluet
Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 February 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

DFG
Sandia LDRD
MADALGO (Center for Massive Data Algorithmics)
UC Lab Fees Research Program Award
NSF

Conference

PPoPP '16

Sponsor:

PPoPP '16: 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

March 12 - 16, 2016

Barcelona, Spain

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
584
Total Downloads

Downloads (Last 12 months)29
Downloads (Last 6 weeks)3

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jünger DKobus RMüller AHundt CXu KLiu WSchmidt B(2022)General-purpose GPU Hashing Data Structures and their Application in Accelerated GenomicsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.01.006Online publication date: Feb-2022
https://doi.org/10.1016/j.jpdc.2022.01.006
Wang KFussell DLin CLee JPetrank E(2021)A fast work-efficient SSSP algorithm for GPUsProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441605(133-146)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3437801.3441605
Junger DKobus RMuller AHundt CXu KLiu WSchmidt B(2020)WarpCore: A Library for fast Hash Tables on GPUs2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC50609.2020.00015(11-20)Online publication date: Dec-2020
https://doi.org/10.1109/HiPC50609.2020.00015
Kobus RJünger DHundt CSchmidt B(2019)GossipProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337889(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337889
Wang YPan YDavidson AWu YYang CWang LOsama MYuan CLiu WRiffel AOwens J(2017)GunrockACM Transactions on Parallel Computing10.1145/31081404:1(1-49)Online publication date: 23-Aug-2017
https://dl.acm.org/doi/10.1145/3108140
Ashkiani SDavidson AMeyer UOwens J(2017)GPU MultisplitACM Transactions on Parallel Computing10.1145/31081394:1(1-44)Online publication date: 23-Aug-2017
https://dl.acm.org/doi/10.1145/3108139
Stehle EJacobsen HChirkova RYang JSuciu D(2017)A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUsProceedings of the 2017 ACM International Conference on Management of Data10.1145/3035918.3064043(417-432)Online publication date: 9-May-2017
https://dl.acm.org/doi/10.1145/3035918.3064043
Wang HLiu WHou KFeng W(2016)Parallel Transposition of Sparse Data StructuresProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926291(1-13)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1145/2925426.2926291
Ben-Sasson EHamilis MSilberstein MTromer E(2016)Fast Multiplication in Binary Fields on GPUs via Register CacheProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926259(1-12)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1145/2925426.2926259
Xiao SLi CGuo BXiao H(2020)A radix sorting parallel algorithm suitable for graphic processing unit computingConcurrency and Computation: Practice and Experience10.1002/cpe.581833:6Online publication date: 30-Sep-2020
https://doi.org/10.1002/cpe.5818

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents