research-article

On sorting and load balancing on GPUs

Authors:

Daniel Cederman,

Philippas TsigasAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 36, Issue 5

Pages 11 - 18

https://doi.org/10.1145/1556444.1556447

Published: 20 June 2009 Publication History

Get Access

Abstract

In this paper we take a look at GPU-Quicksort, an efficient Quicksort algorithm suitable for the highly parallel multi-core graphics processors. Quicksort had previously been considered an inefficient sorting solution for graphics processors, but GPU-Quicksort often performs better than the fastest known sorting implementations for graphics processors, such as radix and bitonic sort. Quicksort can thus be seen as a viable alternative for sorting large quantities of data on graphics processors.

We also take look at a comparison of different load balancing schemes. To get maximum performance on the many-core graphics processors it is important to have an even balance of the workload so that all processing units contribute equally to the task at hand. This can be hard to achieve when the cost of a task is not known beforehand and when new sub-tasks are created dynamically during execution. With the recent advent of scatter operations and atomic hardware primitives it is now possible to bring some of the more elaborate dynamic load balancing schemes from the conventional SMP systems domain to the graphics processor domain.

References

[1]

N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread Scheduling for Multiprogrammed Multiprocessors. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, pages 119--129, 1998.

Digital Library

Google Scholar

[2]

D. Cederman and P. Tsigas. A Practical Quicksort Algorithm for Graphics Processors. In Proceedings of the 16th Annual European Symposium on Algorithms (ESA 2008), Lecture Notes in Computer Science Vol.: 5193, pages 246--258. Springer-Verlag, 2008.

Digital Library

Google Scholar

[3]

D. Cederman and P. Tsigas. On Dynamic Load Balancing on Graphics Processors. In Proceedings of the 11th Graphics Hardware (GH 2008), pages 57--64. ACM press, 2008.

Digital Library

Google Scholar

[4]

D. Cederman and P. Tsigas. GPU Quicksort Library. www.cs.chalmers.se/¿dcs/gpuqsortdcs.html, December 2007.

Google Scholar

[5]

N. CUDA. www.nvidia.com/cuda.

Google Scholar

[6]

N. Govindaraju, N. Raghuvanshi, M. Henson, and D. Manocha. A Cache-Efficient Sorting Algorithm for Database and Data Mining Computations using Graphics Processors. Technical report, University of North Carolina-Chapel Hill, 2005.

Google Scholar

[7]

M. Harris, S. Sengupta, and J. D. Owens. Parallel Prefix Sum (Scan) with CUDA. In H. Nguyen, editor, GPU Gems 3. Addison Wesley, Aug. 2007.

Google Scholar

[8]

D. R. Helman, D. A. Bader, and J. JáJá. A Randomized Parallel Sorting Algorithm with an Experimental Study. Journal of Parallel and Distributed Computing, 52(1):1--23, 1998.

Digital Library

Google Scholar

[9]

M. Matsumoto and T. Nishimura. Mersenne Twister: a 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator. Transactions on Modeling and Computer Simulation, 8(1):3--30, 1998.

Digital Library

Google Scholar

[10]

D. R. Musser. Introspective Sorting and Selection Algorithms. Software -- Practice and Experience, 27(8):983--993, 1997.

Digital Library

Google Scholar

[11]

S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan Primitives for GPU Computing. In Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, pages 97--106, 2007.

Digital Library

Google Scholar

[12]

M. Shephard and M. Georges. Automatic three-dimensional mesh generation by the finite Octree technique. International Journal for Numerical Methods in Engineering, 32:709--749, 1991.

Crossref

Google Scholar

[13]

E. Sintorn and U. Assarsson. Fast Parallel GPU-Sorting Using a Hybrid Algorithm. In Workshop on General Purpose Processing on Graphics Processing Units, 2007.

Google Scholar

[14]

P. Tsigas and Y. Zhang. A simple, fast and scalable nonblocking concurrent FIFO queue for shared memory multiprocessor systems. In Proceedings of the thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 134--143, 2001.

Digital Library

Google Scholar

Cited By

View all

Kaboudian AGray RUzelac ICherry EFenton F(2024)Fast interactive simulations of cardiac electrical activity in anatomically accurate heart structures by compressing sparse uniform cartesian gridsComputer Methods and Programs in Biomedicine10.1016/j.cmpb.2024.108456257(108456)Online publication date: Dec-2024
https://doi.org/10.1016/j.cmpb.2024.108456
Gerbessiotis A(2018)A Study of Integer Sorting on MulticoresParallel Processing Letters10.1142/S012962641850014728:04(1850014)Online publication date: 19-Dec-2018
https://doi.org/10.1142/S0129626418500147
Abdel-Hafeez SGordon-Ross AAbubaker S(2018)A comparison-free sorting algorithm on CPUs and GPUsThe Journal of Supercomputing10.1007/s11227-018-2567-374:11(6369-6400)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1007/s11227-018-2567-3
Show More Cited By

Index Terms

On sorting and load balancing on GPUs

Recommendations

Designing efficient sorting algorithms for manycore GPUs
IPDPS '09: Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing

We describe the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparison-...
Designing and dynamically load balancing hybrid LU for multi/many-core

Designing high-performance LU factorization for modern hybrid multi/many-core systems requires highly-tuned BLAS subroutines, hiding communication latency and balancing the load across devices of variable processing capabilities. In this paper we show ...
Comparison based sorting for systems with multiple GPUs
GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units

As a basic building block of many applications, sorting algorithms that efficiently run on modern machines are key for the performance of these applications. With the recent shift to using GPUs for general purpose compuing, researches have proposed ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 36, Issue 5

December 2008

111 pages

ISSN:0163-5964

DOI:10.1145/1556444

Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2009

Published in SIGARCH Volume 36, Issue 5

Check for updates

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
761
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 11 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Kaboudian AGray RUzelac ICherry EFenton F(2024)Fast interactive simulations of cardiac electrical activity in anatomically accurate heart structures by compressing sparse uniform cartesian gridsComputer Methods and Programs in Biomedicine10.1016/j.cmpb.2024.108456257(108456)Online publication date: Dec-2024
https://doi.org/10.1016/j.cmpb.2024.108456
Gerbessiotis A(2018)A Study of Integer Sorting on MulticoresParallel Processing Letters10.1142/S012962641850014728:04(1850014)Online publication date: 19-Dec-2018
https://doi.org/10.1142/S0129626418500147
Abdel-Hafeez SGordon-Ross AAbubaker S(2018)A comparison-free sorting algorithm on CPUs and GPUsThe Journal of Supercomputing10.1007/s11227-018-2567-374:11(6369-6400)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1007/s11227-018-2567-3
Gopi M(2017)GPU sorting algorithmsAdvances in GPU Research and Practice10.1016/B978-0-12-803738-6.00012-4(307-326)Online publication date: 2017
https://doi.org/10.1016/B978-0-12-803738-6.00012-4
Liang YHuynh HRupnow KGoh RChen D(2015)Efficient GPU Spatial-Temporal MultitaskingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.231334226:3(748-760)Online publication date: Mar-2015
https://doi.org/10.1109/TPDS.2014.2313342
Matadamas-Hernandez JRoman-Alonso GRojas-Gonzalez FCastro-Garcia MBoukerche AAguilar-Cornejo MCordero-Sanchez S(2014)Parallel Simulation of Pore Networks Using Multicore CPUsIEEE Transactions on Computers10.1109/TC.2012.19763:6(1513-1525)Online publication date: 1-Jun-2014
https://dl.acm.org/doi/10.1109/TC.2012.197
Ortega‐Arranz HTorres YR. Llanos DGonzalez‐Escribano AJeannot EŽilinskas J(2014)The All‐Pair Shortest‐Path Problem in Shared‐Memory Heterogeneous SystemsHigh‐Performance Computing on Complex Environments10.1002/9781118711897.ch15(283-299)Online publication date: 18-Apr-2014
https://doi.org/10.1002/9781118711897.ch15
Yan SLong GZhang Y(2013)StreamScanACM SIGPLAN Notices10.1145/2517327.244253948:8(229-238)Online publication date: 23-Feb-2013
https://dl.acm.org/doi/10.1145/2517327.2442539
Yan SLong GZhang YNicolau AShen XAmarasinghe SVuduc R(2013)StreamScanProceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming10.1145/2442516.2442539(229-238)Online publication date: 23-Feb-2013
https://dl.acm.org/doi/10.1145/2442516.2442539
Chatterjee SGrossman MSbîrlea ASarkar V(2013)Dynamic Task Parallelism with a GPU Work-Stealing Runtime SystemLanguages and Compilers for Parallel Computing10.1007/978-3-642-36036-7_14(203-217)Online publication date: 2013
https://doi.org/10.1007/978-3-642-36036-7_14
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Designing efficient sorting algorithms for manycore GPUs

Designing and dynamically load balancing hybrid LU for multi/many-core

Comparison based sorting for systems with multiple GPUs

Comments

Published In

Publisher

Publication History

Check for updates

Qualifiers

Other Metrics

Article Metrics

Other Metrics

Cited By

Login options

Full Access

PDF

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Designing efficient sorting algorithms for manycore GPUs

Designing and dynamically load balancing hybrid LU for multi/many-core

Comparison based sorting for systems with multiple GPUs

Comments

Information

Published In

Publisher

Publication History

Check for updates

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations