Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2851141.2851169acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

GPU multisplit

Published: 27 February 2016 Publication History

Abstract

Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an efficient multisplit on GPUs, programmers often choose to implement multisplit with a sort. However, sort does more work than necessary to implement multisplit, and is thus inefficient. In this work, we provide a parallel model and multiple implementations for the multisplit problem. Our principal focus is multisplit for a small number of buckets. In our implementations, we exploit the computational hierarchy of the GPU to perform most of the work locally, with minimal usage of global operations. We also use warp-synchronous programming models to avoid branch divergence and reduce memory usage, as well as hierarchical reordering of input elements to achieve better coalescing of global memory accesses. On an NVIDIA K40c GPU, for key-only (key-value) multisplit, we demonstrate a 3.0-6.7x (4.4-8.0x) speedup over radix sort, and achieve a peak throughput of 10.0 G keys/s.

References

[1]
The Graph 500 list. http://www.graph500.org/, July 2013.
[2]
Yahoo labs dataset selections. http://webscope.sandbox.yahoo.com/, July 2013.
[3]
D. A. Alcantara, A. Sharf, F. Abbasinejad, S. Sengupta, M. Mitzenmacher, J. D. Owens, and N. Amenta. Real-time parallel hashing on the GPU. ACM Transactions on Graphics, 28(5):154:1--154:9, Dec. 2009.
[4]
A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan. Fast sparse matrix-vector multiplication on GPUs for graph applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '14, pages 781--792, Nov. 2014.
[5]
J. Bang-Jensen and G. Z. Gutin. Digraphs: Theory, Algorithms and Applications, chapter 3.3.4: The Bellman-Ford-Moore Algorithm, pages 97--99. Springer-Verlag London, 2009.
[6]
S. Brown and J. Snoeyink. Modestly faster histogram computations on GPUs. In Proceedings of Innovative Parallel Computing, InPar '12, May 2012.
[7]
A. Davidson, S. Baxter, M. Garland, and J. D. Owens. Work-efficient parallel GPU methods for single source shortest paths. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2014, pages 349--359, May 2014.
[8]
T. A. Davis and Y. Hu. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS), 38(1):1, 2011.
[9]
M. Deo and S. Keely. Parallel suffix array and least common prefix for the GPU. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 197--206, Feb. 2013.
[10]
A. Deshpande and P. J. Narayanan. Can GPUs sort strings efficiently? In 20th International Conference on High Performance Computing, HiPC 2013, pages 305--313, Dec. 2013.
[11]
G. F. Diamos, H. Wu, A. Lele, J. Wang, and S. Yalamanchili. Efficient relational algebra algorithms and data structures for GPU. Technical Report GIT-CERCS-12-01, Georgia Institute of Technology Center for Experimental Research in Computer Systems, Feb. 2012. URL http://www.cercs.gatech.edu/tech-reports/tr2012/git-cercs-12-01.pdf.
[12]
E. W. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik, 1(1):269--271, 1959. ISSN 0029-599X.
[13]
M. Harris, S. Sengupta, and J. D. Owens. Parallel prefix sum (scan) with CUDA. In H. Nguyen, editor, GPU Gems 3, chapter 39, pages 851--876. Addison Wesley, Aug. 2007.
[14]
B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander. Relational joins on graphics processors. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pages 511--524, June 2008.
[15]
Q. Hou, X. Sun, K. Zhou, C. Lauterbach, and D. Manocha. Memory-scalable GPU spatial hierarchy construction. IEEE Transactions on Visualization and Computer Graphics, 17(4):466--474, Apr. 2011.
[16]
E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 28(2): 39--55, Mar./Apr. 2008.
[17]
D. Merrill and A. Grimshaw. Revisiting sorting for GPGPU stream architectures. Technical Report CS2010-03, Department of Computer Science, University of Virginia, Feb. 2010. URL https://sites.google.com/site/duanemerrill/RadixSortTR.pdf.
[18]
U. Meyer. Buckets strike back: Improved parallel shortest paths. In Proceedings of the 16th International Parallel and Distributed Processing Symposium, IPDPS 2002, Apr. 2002. 2002.1015582.
[19]
U. Meyer. Average-case complexity of single-source shortest-paths algorithms: lower and upper bounds. Journal of Algorithms, 48(1): 91--134, Aug. 2003.
[20]
U. Meyer and P. Sanders. Δ-stepping: a parallelizable shortest path algorithm. Journal of Algorithms, 49(1):114--152, Oct. 2003. 1998 European Symposium on Algorithms.
[21]
G. L. Miller and J. H. Reif. Parallel tree contraction---Part 1: Fundamentals. In S. Micali, editor, Randomness and Computation, volume 5 of Advances in Computing Research, pages 47--72. JAI Press Inc., 1989. ISBN 9780892328963.
[22]
L. Monroe, J. Wendelberger, and S. Michalak. Randomized selection on the GPU. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, HPG '11, pages 89--98, Aug. 2011.
[23]
J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue, 6(2):40--53, Mar./Apr. 2008.
[24]
C. Nugteren, G.-J. van den Braak, H. Corporaal, and B. Mesman. High performance predictable histogramming on GPUs: Exploring and evaluating algorithm trade-offs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, page 1. ACM, 2011.
[25]
NVIDIA Corporation. NVIDIA CUDA C programming guide. PG-02829-001 v6.5, Aug. 2014.
[26]
J. Pantaleoni. VoxelPipe: A programmable pipeline for 3D voxelization. In Proceedings of High Performance Graphics, HPG '11, pages 99--106, Aug. 2011. ISBN 978-1-4503-0896-0.
[27]
S. Patidar. Scalable primitives for data mapping and movement on the GPU. Master's thesis, International Institute of Information Technology, Hyderabad, India, June 2009.
[28]
R. Shams and R. A. Kennedy. Efficient histogram algorithms for NVIDIA CUDA compatible devices. In Proceedings of the International Conference on Signal Processing and Communications Systems (ICSPCS), pages 418--422, Gold Coast, Australia, Dec. 2007.
[29]
Z. Wu, F. Zhao, and X. Liu. SAH KD-tree construction on GPU. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, HPG '11, pages 71--78, Aug. 2011. 2018335.
[30]
X. Yang, D. Xu, and L. Zhao. Efficient data management for incoherent ray tracing. Applied Soft Computing, 13(1):1--8, Jan. 2013.

Cited By

View all
  • (2022)General-purpose GPU Hashing Data Structures and their Application in Accelerated GenomicsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.01.006Online publication date: Feb-2022
  • (2021)A fast work-efficient SSSP algorithm for GPUsProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441605(133-146)Online publication date: 17-Feb-2021
  • (2020)WarpCore: A Library for fast Hash Tables on GPUs2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC50609.2020.00015(11-20)Online publication date: Dec-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
February 2016
420 pages
ISBN:9781450340922
DOI:10.1145/2851141
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 February 2016

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

  • DFG
  • Sandia LDRD
  • MADALGO (Center for Massive Data Algorithmics)
  • UC Lab Fees Research Program Award
  • NSF

Conference

PPoPP '16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)29
  • Downloads (Last 6 weeks)3
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2022)General-purpose GPU Hashing Data Structures and their Application in Accelerated GenomicsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.01.006Online publication date: Feb-2022
  • (2021)A fast work-efficient SSSP algorithm for GPUsProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441605(133-146)Online publication date: 17-Feb-2021
  • (2020)WarpCore: A Library for fast Hash Tables on GPUs2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC50609.2020.00015(11-20)Online publication date: Dec-2020
  • (2019)GossipProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337889(1-10)Online publication date: 5-Aug-2019
  • (2017)GunrockACM Transactions on Parallel Computing10.1145/31081404:1(1-49)Online publication date: 23-Aug-2017
  • (2017)GPU MultisplitACM Transactions on Parallel Computing10.1145/31081394:1(1-44)Online publication date: 23-Aug-2017
  • (2017)A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUsProceedings of the 2017 ACM International Conference on Management of Data10.1145/3035918.3064043(417-432)Online publication date: 9-May-2017
  • (2016)Parallel Transposition of Sparse Data StructuresProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926291(1-13)Online publication date: 1-Jun-2016
  • (2016)Fast Multiplication in Binary Fields on GPUs via Register CacheProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926259(1-12)Online publication date: 1-Jun-2016
  • (2020)A radix sorting parallel algorithm suitable for graphic processing unit computingConcurrency and Computation: Practice and Experience10.1002/cpe.581833:6Online publication date: 30-Sep-2020

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media