Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

On sorting and load balancing on GPUs

Published: 20 June 2009 Publication History

Abstract

In this paper we take a look at GPU-Quicksort, an efficient Quicksort algorithm suitable for the highly parallel multi-core graphics processors. Quicksort had previously been considered an inefficient sorting solution for graphics processors, but GPU-Quicksort often performs better than the fastest known sorting implementations for graphics processors, such as radix and bitonic sort. Quicksort can thus be seen as a viable alternative for sorting large quantities of data on graphics processors.
We also take look at a comparison of different load balancing schemes. To get maximum performance on the many-core graphics processors it is important to have an even balance of the workload so that all processing units contribute equally to the task at hand. This can be hard to achieve when the cost of a task is not known beforehand and when new sub-tasks are created dynamically during execution. With the recent advent of scatter operations and atomic hardware primitives it is now possible to bring some of the more elaborate dynamic load balancing schemes from the conventional SMP systems domain to the graphics processor domain.

References

[1]
N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread Scheduling for Multiprogrammed Multiprocessors. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, pages 119--129, 1998.
[2]
D. Cederman and P. Tsigas. A Practical Quicksort Algorithm for Graphics Processors. In Proceedings of the 16th Annual European Symposium on Algorithms (ESA 2008), Lecture Notes in Computer Science Vol.: 5193, pages 246--258. Springer-Verlag, 2008.
[3]
D. Cederman and P. Tsigas. On Dynamic Load Balancing on Graphics Processors. In Proceedings of the 11th Graphics Hardware (GH 2008), pages 57--64. ACM press, 2008.
[4]
D. Cederman and P. Tsigas. GPU Quicksort Library. www.cs.chalmers.se/¿dcs/gpuqsortdcs.html, December 2007.
[5]
N. CUDA. www.nvidia.com/cuda.
[6]
N. Govindaraju, N. Raghuvanshi, M. Henson, and D. Manocha. A Cache-Efficient Sorting Algorithm for Database and Data Mining Computations using Graphics Processors. Technical report, University of North Carolina-Chapel Hill, 2005.
[7]
M. Harris, S. Sengupta, and J. D. Owens. Parallel Prefix Sum (Scan) with CUDA. In H. Nguyen, editor, GPU Gems 3. Addison Wesley, Aug. 2007.
[8]
D. R. Helman, D. A. Bader, and J. JáJá. A Randomized Parallel Sorting Algorithm with an Experimental Study. Journal of Parallel and Distributed Computing, 52(1):1--23, 1998.
[9]
M. Matsumoto and T. Nishimura. Mersenne Twister: a 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator. Transactions on Modeling and Computer Simulation, 8(1):3--30, 1998.
[10]
D. R. Musser. Introspective Sorting and Selection Algorithms. Software -- Practice and Experience, 27(8):983--993, 1997.
[11]
S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan Primitives for GPU Computing. In Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, pages 97--106, 2007.
[12]
M. Shephard and M. Georges. Automatic three-dimensional mesh generation by the finite Octree technique. International Journal for Numerical Methods in Engineering, 32:709--749, 1991.
[13]
E. Sintorn and U. Assarsson. Fast Parallel GPU-Sorting Using a Hybrid Algorithm. In Workshop on General Purpose Processing on Graphics Processing Units, 2007.
[14]
P. Tsigas and Y. Zhang. A simple, fast and scalable nonblocking concurrent FIFO queue for shared memory multiprocessor systems. In Proceedings of the thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 134--143, 2001.

Cited By

View all
  • (2024)Fast interactive simulations of cardiac electrical activity in anatomically accurate heart structures by compressing sparse uniform cartesian gridsComputer Methods and Programs in Biomedicine10.1016/j.cmpb.2024.108456257(108456)Online publication date: Dec-2024
  • (2018)A Study of Integer Sorting on MulticoresParallel Processing Letters10.1142/S012962641850014728:04(1850014)Online publication date: 19-Dec-2018
  • (2018)A comparison-free sorting algorithm on CPUs and GPUsThe Journal of Supercomputing10.1007/s11227-018-2567-374:11(6369-6400)Online publication date: 1-Nov-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 36, Issue 5
December 2008
111 pages
ISSN:0163-5964
DOI:10.1145/1556444
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2009
Published in SIGARCH Volume 36, Issue 5

Check for updates

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 11 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Fast interactive simulations of cardiac electrical activity in anatomically accurate heart structures by compressing sparse uniform cartesian gridsComputer Methods and Programs in Biomedicine10.1016/j.cmpb.2024.108456257(108456)Online publication date: Dec-2024
  • (2018)A Study of Integer Sorting on MulticoresParallel Processing Letters10.1142/S012962641850014728:04(1850014)Online publication date: 19-Dec-2018
  • (2018)A comparison-free sorting algorithm on CPUs and GPUsThe Journal of Supercomputing10.1007/s11227-018-2567-374:11(6369-6400)Online publication date: 1-Nov-2018
  • (2017)GPU sorting algorithmsAdvances in GPU Research and Practice10.1016/B978-0-12-803738-6.00012-4(307-326)Online publication date: 2017
  • (2015)Efficient GPU Spatial-Temporal MultitaskingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.231334226:3(748-760)Online publication date: Mar-2015
  • (2014)Parallel Simulation of Pore Networks Using Multicore CPUsIEEE Transactions on Computers10.1109/TC.2012.19763:6(1513-1525)Online publication date: 1-Jun-2014
  • (2014)The All‐Pair Shortest‐Path Problem in Shared‐Memory Heterogeneous SystemsHigh‐Performance Computing on Complex Environments10.1002/9781118711897.ch15(283-299)Online publication date: 18-Apr-2014
  • (2013)StreamScanACM SIGPLAN Notices10.1145/2517327.244253948:8(229-238)Online publication date: 23-Feb-2013
  • (2013)StreamScanProceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming10.1145/2442516.2442539(229-238)Online publication date: 23-Feb-2013
  • (2013)Dynamic Task Parallelism with a GPU Work-Stealing Runtime SystemLanguages and Compilers for Parallel Computing10.1007/978-3-642-36036-7_14(203-217)Online publication date: 2013
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media